Tibbles

Task 1:Loading the tidyverse package.

library(tidyverse)

Task 2:Converting the iris dataset to a tibble.

as_tibble(iris)

Task 3: Creating a tibble with columns “x,” “y,” and “z,” where “x” ranges from 1 to 5, “y” is 1 for all rows, and “z” is calculated as the square of “x” plus “y” for each row.

tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)

Task 4:Creating a tibble with columns named “:)” (representing “smile”), ” ” (representing “space”), and “2000” (representing “number”).

tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)
tb

Task 5:Creating a tibble with columns “x,” “y,” and “z,” containing the values “a,” 2, 3.6 and “b,” 1, 8.5 respectively.

tribble(
  ~x, ~y, ~z,
  
  "a", 2, 3.6,
  "b", 1, 8.5
)

Tibbles vs. data.frame

Task-1:Creating a tibble with columns “a,” “b,” “c,” “d,” and “e,” containing 1000 randomly generated values for each column, representing dates, numbers, and letters.

tibble(
  a = lubridate::now() + runif(1e3) * 86400,
  b = lubridate::today() + runif(1e3) * 30,
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)

Task 2: Tnstalling the package

package_to_install <- c("nycflights13")

for (package_name in package_to_install) {
  if (!requireNamespace(package_name, quietly = TRUE)) {
    install.packages(package_name)
  }
}
library(nycflights13)

Task 3: Printing the first 10 rows of the nycflights13::flights dataset with unlimited width.

nycflights13::flights %>% 
  print(n = 10, width = Inf)

Task 4: Viewing the nycflights13::flights dataset in a separate window for interactive exploration.

nycflights13::flights %>% 
  View()

Subsetting

Task 1: Creating a tibble named “df” with columns “x” and “y,” then accessing the “x” column using different methods:

df <- tibble(
  x = runif(5),#function that generates random numbers from a uniform distribution
  y = rnorm(5) # function that generates random numbers from a normal (Gaussian) distribution
)

df$x
[1] 0.4134781 0.3841133 0.5761670 0.6047906 0.8490257
df[["x"]]
[1] 0.4134781 0.3841133 0.5761670 0.6047906 0.8490257
df[[1]]
[1] 0.4134781 0.3841133 0.5761670 0.6047906 0.8490257
df %>% .$x
[1] 0.4134781 0.3841133 0.5761670 0.6047906 0.8490257

Interacting with older code

Task-1: Determining the class of the object “tb” after converting it to a data frame.

class(as.data.frame(tb))
[1] "data.frame"

Exercises

Task-1: How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame).

mtcars

Task-2

# In a data.frame, extracting a non-existent column returns NULL,
# whereas in a tibble, it raises an error, providing immediate feedback.
# Other operations, such as extracting existing columns and subsets of columns,
# behave similarly across both data frames and tibbles.
# The default behavior of data.frames may lead to frustration
# due to the lack of error feedback for non-existent columns,
# potentially causing unnoticed mistakes and difficulty in debugging.
# In contrast, tibbles offer more robust behavior, enhancing data integrity
# and debugging efficiency.

df <- data.frame(abc = 1, xyz = "a")

# Extracting non-existent column in a data.frame
df$x  # Returns NULL
[1] "a"
# Extracting existing column in a data.frame
df[, "xyz"]  # Returns a data frame with one column containing the values of the "xyz" column
[1] "a"
# Extracting multiple columns in a data.frame
df[, c("abc", "xyz")]  # Returns a data frame containing only the specified columns
NA

Task-3:If you have the name of a variable stored in an object, e.g. var <- “mpg”, how can you extract the reference variable from a tibble?

No pacakages

# heights <- read_csv("data/heights.csv")

Task 1: listing several tables: table1, table2, table3, table4a, and table4b.

table1
table2
table3
table4a
table4b

Task 2: Calculating the rate by dividing the number of cases by the population and then multiplying by 10,000 for table1.

table1 %>% 
  mutate(rate = cases / population * 10000)

Task 3: Counting the occurrences of each year in table1, using the ‘cases’ column as the weight.

table1 %>% 
  count(year, wt = cases)

Task 4: Creating a ggplot using table1, plotting ‘year’ against ‘cases’ with lines grouped by ‘country’ and colored in grey50, along with points colored by ‘country’.

library(ggplot2)
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))

Pivoting

Longer

Task-1: referring to ‘table4a’

table4a

Task-2: Reshaping table4a using pivot_longer for columns ‘1999’ and ‘2000’ into ‘year’ and ‘cases’.

table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

Task-3: Reshaping table4b with pivot_longer for columns ‘1999’ and ‘2000’ into ‘year’ and ‘population’.

table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")  #function transforms wide data into long format by stacking multiple columns into two: one for variable names and one for their corresponding values

Task-4: creating tidy datasets tidy4a and tidy4b by using pivot_longer on table4a and table4b to reshape them. Then, performing a left join on tidy4a and tidy4b.

tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b)
Joining with `by = join_by(country, year)`

Wider

Task-1:Displaying table 2

table2

Task-2: using the pivot_wider function on table2 to transform it from long to wide format, with ‘type’ becoming the new column names and ‘count’ being the corresponding values.

table2 %>%
    pivot_wider(names_from = type, values_from = count)

Separating and uniting

Separate

Task-1:displaying table3

 table3

Task-2: Using the separate function on table3 splits the ‘rate’ column into two separate columns named ‘cases’ and ‘population’.

table3 %>% 
  separate(rate, into = c("cases", "population"))

Task-3:Using the separate function on table3 splits the ‘rate’ column into two separate columns named ‘cases’ and ‘population’, using the ‘/’ character as the separator.

table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")

Task-4:Using the separate function on table3 splits the ‘rate’ column into two separate columns named ‘cases’ and ‘population’, converting the resulting columns to their appropriate data types.

table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)

Task-5: Applying the separate function to table3, the ‘year’ column is divided into two separate columns labeled ‘century’ and ‘year’, with the separator defined as the second character.

table3 %>% 
  separate(year, into = c("century", "year"), sep = 2)

Unite

Task-1: The unite function is applied to table5 to merge the ‘century’ and ‘year’ columns into a single column named ‘new’.

table5 %>% 
  unite(new, century, year)

Task-2: unite function is applied to table5 to merge the ‘century’ and ‘year’ columns into a single column named ‘new’, with no separator between them.

table5 %>% 
  unite(new, century, year, sep = "")

Missing values

Task-1: Create a tibble named “stocks” with columns “year”, “qtr” (quarter), and “return”, having data for 2015 and 2016, with quarterly returns specified and some missing entries as NA.

stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

Task-2:Pivoting the “stocks” tibble to widen the data, extracting columns from the “year” variable and values from the “return” variable.

stocks %>% 
  pivot_wider(names_from = year, values_from = return)

Task-3: pivot the data to a wide format with columns for each year’s returns, then reshape it back to a long format, keeping only the non-missing values in the “return” column.

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(
    cols = c(`2015`, `2016`), 
    names_to = "year", 
    values_to = "return", 
    values_drop_na = TRUE
  )

Task-4:Filling missing combinations of “year” and “qtr” in the “stocks” dataset.

stocks %>% 
  complete(year, qtr)

Task-5:Creating a tibble named “treatment” containing information about individuals, their treatment groups, and their responses, with some missing values for the “person” column.

treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)

Task-6: Filling the missing values in the “person” column of the “treatment” tibble.

treatment %>% 
  fill(person)
NA

Case Study

Task-1: Loading data set

who

Task-2:Pivoting the “who” dataset from wide to long format, condensing columns into “cases” and capturing the original column names in “key”.

who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1

Task-3:Counting the occurrences of each “key” in the “who1” dataset.

  who1 %>% 
    count(key)

Task-4:Replacing “newrel” with “new_rel” in the “key” column of the “who1” dataset to create “who2.”

who2 <- who1 %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2

Task-5:Separating the “key” column in the “who2” dataset into “new,” “type,” and “sexage” columns using “_” as the separator to create “who3.”

who3 <- who2 %>% 
  separate(key, c("new", "type", "sexage"), sep = "_")
who3

Task-6:Counting the occurrences of each unique value in the “new” column of the “who3” dataset.

who3 %>% 
  count(new)

Task-7:Removing the “new”, “iso2”, and “iso3” columns from the “who3” dataset and assigning the result to “who4”.

who4 <- who3 %>% 
  select(-new, -iso2, -iso3)

Task-8:Splitting the “sexage” column of the “who4” dataset into “sex” and “age” columns, separated by the first character, and assigning the result to “who5”.

who5 <- who4 %>% 
  separate(sexage, c("sex", "age"), sep = 1)
who5

Task-9:Transforming the “who” dataset from wide to long format, adjusting column names, extracting meaningful variables, dropping unnecessary columns, and splitting the “sexage” column into “sex” and “age”.

who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
NA

CH-13: Relational data

Task-1:Loding the libraries

library(tidyverse)
library(nycflights13)

nycflights13

Task-1: airlines data

airlines

Task-2: airports data

airports

Task-3: planes data

planes 

Task-4: weather data

weather 

Keys

Task-1Counting the occurrences of each tail number in the “planes” table and filtering for those with more than one occurrence.

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)

Task-2:Counting the occurrences of each combination of year, month, day, hour, and origin in the “weather” table and filtering for those with more than one occurrence.

weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1)

Task-3:Counting the occurrences of each combination of year, month, day, and flight in the “flights” table and filtering for those with more than one occurrence.

flights %>% 
  count(year, month, day, flight) %>% 
  filter(n > 1)

Task-4:Counting the occurrences of each combination of year, month, day, and tail number in the “flights” table and filtering for those with more than one occurrence.

flights %>% 
  count(year, month, day, tailnum) %>% 
  filter(n > 1)

Mutating joins

Task-1: Creating a subset of the “flights” table named “flights2” containing columns from “year” to “day”, “hour”, “origin”, “dest”, “tailnum”, and “carrier”.

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2

Task-2:Removing the “origin” and “dest” columns from “flights2” table and then performing a left join with the “airlines” table, using the “carrier” column as the key for matching.

flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier")

Task-3:Shortening the command by removing “selecting” and directly “mutating” the “name” column with the corresponding airline names from the “airlines” table based on the “carrier” column.

flights2 %>%
  select(-origin, -dest) %>% 
  mutate(name = airlines$name[match(carrier, airlines$carrier)])

Understanding joins

Task-1:Creating two tibbles, “x” and “y”, each with a “key” column and an associated “val_x” or “val_y” column, respectively.

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

x
y

Inner join

Task-1:Joining tibbles x and y using an inner join operation based on the “key” column.

x %>% 
  inner_join(y, by = "key")

Duplicate keys

Task-1: Joining tibble x with tibble y using the common column “key”.

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)

Task-2:Performing a left join between tibble x and tibble y based on the common column “key”.

left_join(x, y, by = "key")

Task-3:Creating two tibbles, x and y, with columns “key”, “val_x”, and “val_y”, populated with corresponding values.

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4"
)

Task-4:Performing a left join on tibbles x and y using the “key” column as the join key.

left_join(x, y, by = "key")
Warning: Detected an unexpected many-to-many relationship between `x` and `y`.

Defining the key columns

Task-1:Performing a left join between the flights2 tibble and the weather tibble.

flights2 %>% 
  left_join(weather)
Joining with `by = join_by(year, month, day, hour, origin)`

Task-2:Performing a left join between the flights2 tibble and the planes tibble using the “tailnum” column as the key.

flights2 %>% 
  left_join(planes, by = "tailnum")

Task-3:Performing a left join between the flights2 tibble and the airports tibble, matching the “dest” column from flights2 with the “faa” column from airports.

flights2 %>% 
  left_join(airports, c("dest" = "faa"))

Task-4:Performing a left join between the flights2 tibble and the airports tibble, matching the “origin” column from flights2 with the “faa” column from airports.

flights2 %>% 
  left_join(airports, c("origin" = "faa"))

Filtering joins

Task-1: Calculating the top 10 destinations by counting the occurrences in the “dest” column of the flights tibble, sorted in descending order, and then displaying the result.

top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest

Task-2: Filtering the flights tibble to include only rows where the destination (dest) matches any of the top 10 destinations identified in the previous step.

flights %>% 
  filter(dest %in% top_dest$dest)
#%in% operator in R is used to check if elements in one vector are present in another vector

Task-3: Selecting rows from the flights dataset where the destination airport matches one of the top 10 destinations previously identified.

flights %>% 
  semi_join(top_dest)
Joining with `by = join_by(dest)`

Task-4: Filtering out flights with tail numbers present in the planes dataset and counting the occurrences of each unique tail number, sorting the result.

flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)

Set operations

Task-1:creating two tibbles, df1 and df2, each with columns x and y, containing sample data.

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)

Task-2:performing set operations on the tibbles df1 and df2, including intersection, union, and set differences.

intersect(df1, df2)
union(df1, df2)
setdiff(df1, df2)
setdiff(df2, df1)

CH-14: Strings

Basic Info:string1 <- “This is a string” string2 <- ‘If I want to include a “quote” inside a string, I use single quotes’

Task-1:To include a literal single or double quote in a string you can use  to “escape” it

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

Task-2: Understanding the character


x <- c("\"", "\\") #backslash is escape character
x
[1] "\"" "\\"
writeLines(x)
"
\

String length

Task-1:

str_length(c("a", "R for data science", NA))
[1]  1 18 NA

Combining strings

Task-1:Combining the strings

str_c("x", "y")
[1] "xy"
str_c("x", "y", "z")
[1] "xyz"

Task-2:Using the sep argument to control how they’re separated.

str_c("x", "y", sep = ", ")
[1] "x, y"

Task-3:Performing concatenation with “|” and “-” at both ends of each element of vector x, and replacing NA values with empty strings before concatenation.

x <- c("abc", NA)
str_c("|-", x, "-|")
[1] "|-abc-|" NA       
str_c("|-", str_replace_na(x), "-|")
[1] "|-abc-|" "|-NA-|" 

Task-4: concatenating each element of the vector c(“a”, “b”, “c”) with a prefix “prefix-” and a suffix “-suffix”.

str_c("prefix-", c("a", "b", "c"), "-suffix")
[1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

Task-5: combining strings

name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE

str_c(
  "Good ", time_of_day, " ", name,
  if (birthday) " and HAPPY BIRTHDAY",
  "."
)
[1] "Good morning Hadley."

Subsetting strings

Task-1:Extracting the first three characters from each element in the vector x using str_sub.

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
[1] "App" "Ban" "Pea"

Task-2:negative numbers count backwards from end

str_sub(x, -3, -1)
[1] "ple" "ana" "ear"

Task-3:using the assignment form of str_sub() to modify strings

str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
[1] "apple"  "banana" "pear"  

Locales

Task-1:Changing the case

str_to_upper(c("i", "ı"))
[1] "I" "I"
str_to_upper(c("i", "ı"), locale = "tr")
[1] "İ" "I"

Task-2:Sorting the character vector x alphabetically using the English (en) locale and the Hawaiian (haw) locale.

x <- c("apple", "eggplant", "banana")
str_sort(x, locale = "en") 
[1] "apple"    "banana"   "eggplant"
str_sort(x, locale = "haw") 
[1] "apple"    "eggplant" "banana"  

Matching patterns with regular expressions

Basic matches

Task-1:Searching for the pattern “an” within each element of x and displaying the matches.

x <- c("apple", "banana", "pear")
str_view(x, "an")
[2] │ b<an><an>a

Task-2:Displaying elements in x where any character is followed by “a” and then any character.

str_view(x, ".a.")
[2] │ <ban>ana
[3] │ p<ear>

Task-3

# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
writeLines(dot)
\.
# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
[2] │ <a.c>

Task-4: Displaying elements in x where the sequence “\” occurs.

x <- "a\\b"
writeLines(x)
a\b
str_view(x, "\\\\")
[1] │ a<\>b

Anchors

Task-1: Displaying elements in x that start with “a” and end with “a” respectively.

x <- c("apple", "banana", "pear")
str_view(x, "^a")
[1] │ <a>pple
str_view(x, "a$")
[2] │ banan<a>

Task-2: Highlighting “apple” occurrences in x and instances where it’s the only content.

x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
[1] │ <apple> pie
[2] │ <apple>
[3] │ <apple> cake
str_view(x, "^apple$")
[2] │ <apple>

Character classes and alternatives

Task-1: Visualizing patterns matching “a.c”, “a*c”, and “a c” in the provided character vector.

str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
[2] │ <a.c>
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
[3] │ <a*c>
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
[4] │ <a >c

Task-2: Visualizing patterns matching “grey” or “gray” in the provided character vector.

str_view(c("grey", "gray"), "gr(e|a)y")
[1] │ <grey>
[2] │ <gray>

Repetition

Task-1:Identifying patterns “CC” or “C” in the string “1888 is the longest year in Roman numerals

x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
[1] │ 1888 is the longest year in Roman numerals: MD<CC><C>LXXXVIII

Task-2: Viewing the pattern “CC”

str_view(x, "CC+")
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII

Task-3: Viewing the pattern “C[LX]+”

str_view(x, 'C[LX]+')
[1] │ 1888 is the longest year in Roman numerals: MDCC<CLXXX>VIII

Task-4:Viewing the pattern “C{2},C{2,},c{2,3}”

str_view(x, "C{2}")
[1] │ 1888 is the longest year in Roman numerals: MD<CC>CLXXXVIII
str_view(x, "C{2,}")
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
str_view(x, "C{2,3}")
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII

Grouping and backreferences

Task-1:Grouping

str_view(fruit, "(..)\\1", match = TRUE)
 [4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry

Detect matches

Task-1: Checking for the presence of the letter “e” in each word

x <- c("apple", "banana", "pear")
str_detect(x, "e")
[1]  TRUE FALSE  TRUE

Task-2:Checking how many common words start with t

sum(str_detect(words, "^t"))
[1] 65

Task-3: Checking proportion of common words end with a vowel

mean(str_detect(words, "[aeiou]$"))
[1] 0.2765306

Task-4:Finding all words containing at least one vowel, and negate

no_vowels_1 <- !str_detect(words, "[aeiou]")

Task-5:Finding all words consisting only of consonants (non-vowels)

no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
[1] TRUE

Task-6: Filtering words that end with the letter “x” from a list of words.

words[str_detect(words, "x$")]
[1] "box" "sex" "six" "tax"
str_subset(words, "x$")
[1] "box" "sex" "six" "tax"

Task-7: Filtering a tibble for words that end with “x”.

df <- tibble(
  word = words, 
  i = seq_along(word)
)
df %>% 
  filter(str_detect(word, "x$"))

Task-8:Counting the occurrences of “a” in each element of a character vector.

x <- c("apple", "banana", "pear")
str_count(x, "a")
[1] 1 3 1

Task-9: Seeing average of how many vowels per word

mean(str_count(words, "[aeiou]"))
[1] 1.991837

Task-10: Adding columns to a tibble to count vowels and consonants in each word.

df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )

Task-11:Counting “aba” occurrences in “abababa” and showing all “aba” instances.

str_count("abababa", "aba")
[1] 2
str_view_all("abababa", "aba")
Warning: `str_view_all()` was deprecated in stringr 1.5.0.
Please use `str_view()` instead.
[1] │ <aba>b<aba>

Extract matches

Task-1: Displaying the length of sentences and showing the first few sentences.

length(sentences)
[1] 720
head(sentences)
[1] "The birch canoe slid on the smooth planks."  "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."      "These days a chicken leg is a rare dish."   
[5] "Rice is often served in round bowls."        "The juice of lemons makes fine punch."      

Task-2: Creating a string pattern to match colors by concatenating them with a pipe delimiter.

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
[1] "red|orange|yellow|green|blue|purple"

Task-3: Filter sentences for colors and extract matches, showing the first few.

has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
[1] "blue" "blue" "red"  "red"  "red"  "blue"

Task-4:Showing all sentences containing multiple colors and highlight the matches.

more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
[1] │ It is hard to erase <blue> or <red> ink.
[2] │ The <green> light in the brown box flicke<red>.
[3] │ The sky in the west is tinged with <orange> <red>.

Task-5:Extracting all color matches from the subset of sentences containing multiple colors.

str_extract(more, colour_match)
[1] "blue"   "green"  "orange"

Task-6:Extracting all occurrences of colors from the subset of sentences containing multiple colors.

str_extract_all(more, colour_match)
[[1]]
[1] "blue" "red" 

[[2]]
[1] "green" "red"  

[[3]]
[1] "orange" "red"   

Task-7: Extracting colors from sentences with multiple colors and simplify, also extract lowercase letters from each element in x and simplify.

str_extract_all(more, colour_match, simplify = TRUE)
     [,1]     [,2] 
[1,] "blue"   "red"
[2,] "green"  "red"
[3,] "orange" "red"
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
     [,1] [,2] [,3]
[1,] "a"  ""   ""  
[2,] "a"  "b"  ""  
[3,] "a"  "b"  "c" 

Grouped matches

Task-1: Extracting sentences containing nouns defined by a pattern, then extracts the nouns from those sentences.

noun <- "(a|the) ([^ ]+)"

has_noun <- sentences %>%
  str_subset(noun) %>%
  head(10)
has_noun %>% 
  str_extract(noun)
 [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked" "the sun"    "the huge"   "the ball"  
 [9] "the woman"  "a helps"   

Task-2:

has_noun %>% 
  str_match(noun)
      [,1]         [,2]  [,3]     
 [1,] "the smooth" "the" "smooth" 
 [2,] "the sheet"  "the" "sheet"  
 [3,] "the depth"  "the" "depth"  
 [4,] "a chicken"  "a"   "chicken"
 [5,] "the parked" "the" "parked" 
 [6,] "the sun"    "the" "sun"    
 [7,] "the huge"   "the" "huge"   
 [8,] "the ball"   "the" "ball"   
 [9,] "the woman"  "the" "woman"  
[10,] "a helps"    "a"   "helps"  

Task-3:Creating a tibble with columns ‘article’ and ‘noun’ extracted from sentences based on a pattern.

tibble(sentence = sentences) %>% 
  tidyr::extract(
    sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
    remove = FALSE
  )

Replacing matches

Task-1: Replacing the first vowel in each word of x with a hyphen. Replacing all vowels in each word of x with a hyphen.

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
[1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
[1] "-ppl-"  "p--r"   "b-n-n-"

Task-2: Replacing numeric values in x with their corresponding word representations.

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
[1] "one house"    "two cars"     "three people"

Task-3:Reordering words in sentences by swapping the second and third word positions.

sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(5)
[1] "The canoe birch slid on the smooth planks."  "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."      "These a days chicken leg is a rare dish."   
[5] "Rice often is served in round bowls."       

Splitting

Task-1: Splitting the first five sentences into words.

sentences %>%
  head(5) %>% 
  str_split(" ")
[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth"  "planks."

[[2]]
[1] "Glue"        "the"         "sheet"       "to"          "the"         "dark"        "blue"       
[8] "background."

[[3]]
[1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

[[4]]
[1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"       "rare"    "dish."  

[[5]]
[1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

Task-2:Splitting the string ‘a|b|c|d’ by ‘|’ into a vector of elements.

"a|b|c|d" %>% 
  str_split("\\|") %>% 
  .[[1]]
[1] "a" "b" "c" "d"

Task-3:Splitting the first 5 sentences by space into a matrix of words.

sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)
     [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]          [,9]   
[1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."     ""     
[2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background." ""     
[3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"           "well."
[4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"        "dish."
[5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""            ""     

Task-4:Splitting each field string into two parts at the first occurrence of ‘:’.

fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
     [,1]      [,2]    
[1,] "Name"    "Hadley"
[2,] "Country" "NZ"    
[3,] "Age"     "35"    

Task-5: Display word boundaries, split by spaces, and split by word boundaries, respectively.

x <- "This is a sentence.  This is another sentence."
str_view_all(x, boundary("word"))
[1] │ <This> <is> <a> <sentence>.  <This> <is> <another> <sentence>.
str_split(x, " ")[[1]]
[1] "This"      "is"        "a"         "sentence." ""          "This"      "is"        "another"  
[9] "sentence."
str_split(x, boundary("word"))[[1]]
[1] "This"     "is"       "a"        "sentence" "This"     "is"       "another"  "sentence"

Other types of pattern

Task-1:

# The regular call:
str_view(fruit, "nana")
[4] │ ba<nana>
# Is shorthand for
str_view(fruit, regex("nana"))
[4] │ ba<nana>

Task-2:Visualizing occurrences of “banana” in different case variations.

bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
[1] │ <banana>
str_view(bananas, regex("banana", ignore_case = TRUE))
[1] │ <banana>
[2] │ <Banana>
[3] │ <BANANA>

Task-3: Extracting all lines starting with “Line” from the text.

x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
[1] "Line"

Task-4: Extracting all occurrences of lines starting with “Line” from the text, considering each line separately.

str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
[1] "Line" "Line" "Line"

Task-5:Creating a regular expression pattern for phone numbers, allowing for variations in formatting, and attempting to match it against the provided phone number.

phone <- regex("
  \\(?     # optional opening parens
  (\\d{3}) # area code
  [) -]?   # optional closing parens, space, or dash
  (\\d{3}) # another three numbers
  [ -]?    # optional space or dash
  (\\d{3}) # three more numbers
  ", comments = TRUE)

str_match("514-791-8141", phone)
     [,1]          [,2]  [,3]  [,4] 
[1,] "514-791-814" "514" "791" "814"

Task-6:Installling the package and Benchmarking string detection in “sentences” using fixed and regex patterns 20 times each, comparing performance with microbenchmark.


package_to_install <- c("microbenchmark")

for (package_name in package_to_install) {
  if (!requireNamespace(package_name, quietly = TRUE)) {
    install.packages(package_name)
  }
}
library(microbenchmark)

microbenchmark::microbenchmark(
  fixed = str_detect(sentences, fixed("the")),
  regex = str_detect(sentences, "the"),
  times = 20
  )
Unit: microseconds

Task-7:Starting with a1 being “0e1” and a2 being “a301”, both representing the character “á”, they are compared for equality.

a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
[1] "á" "á"
a1 == a2
[1] FALSE

Task-8: Checking if a1 contains the fixed string a2 returns FALSE, whereas using collation rules returns TRUE.

str_detect(a1, fixed(a2))
[1] FALSE
str_detect(a1, coll(a2))
[1] TRUE

Task-9:Creating a vector i with different forms of the letter “i”, then using str_subset to filter them based on collation.

i <- c("I", "İ", "i", "ı")
i
[1] "I" "İ" "i" "ı"
str_subset(i, coll("i", ignore_case = TRUE))
[1] "I" "i"
str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
[1] "İ" "i"

Task-10: Fetching locale information.

stringi::stri_locale_info()
$Language
[1] "en"

$Country
[1] "US"

$Variant
[1] ""

$Name
[1] "en_US"

Task-11:Visualizing word boundaries and extracts all words from the string.

x <- "This is a sentence."
str_view_all(x, boundary("word"))
[1] │ <This> <is> <a> <sentence>.
str_extract_all(x, boundary("word"))
[[1]]
[1] "This"     "is"       "a"        "sentence"

CH-15: Factors

Creatig factors

Task-1:Adding character vector in variable x1

x1 <- c("Dec", "Apr", "Jan", "Mar")

Task-2:Adding character vector in variable x2

x2 <- c("Dec", "Apr", "Jam", "Mar")

Task-3:Sorting X1

sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"

Task-4:Adding Character vector in month_levels

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

Task-5:Assigning the factor levels to the variable x1, using the predefined month_levels.

y1 <- factor(x1, levels = month_levels)
y

Task-6:Sorting the factor levels in y1.

sort(y1)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Task-7:creating a factor y2 from x2 with custom levels specified by month_levels.

y2 <- factor(x2, levels = month_levels)
y2
[1] Dec  Apr  <NA> Mar 
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Task-8:parsing the values in x2 as factors

y2 <- parse_factor(x2, levels = month_levels)
Warning: 1 parsing failure.
row col           expected actual
  3  -- value in level set    Jam

Task-9: omitting the levels.

factor(x1)
[1] Dec Apr Jan Mar
Levels: Apr Dec Jan Mar

Task-10:Creating a factor f1 from the values in x1, using the unique values of x1 as levels.

f1 <- factor(x1, levels = unique(x1))
f1
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar

Task-11: creating a factor f2 from the values in x1, ordering them according to their appearance in x1.

f2 <- x1 %>% factor() %>% fct_inorder()
f2
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar

Task-12:Omitting levels2

levels(f2)
[1] "Dec" "Apr" "Jan" "Mar"

General Social Survey

Task-1:Loading datasets

gss_cat

Task-2:Seeing levels through count()

gss_cat %>%
  count(race)

Task-3:Also seeing through bar()

ggplot(gss_cat, aes(race)) +
  geom_bar()

Task-4:Generating a bar plot using ggplot()

ggplot(gss_cat,aes(race))+geom_bar()+scale_x_discrete(drop=FALSE)

Modifying factor order

Task-1:calculating summary statistics and then creating scatter plot

relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig_summary, aes(tvhours, relig)) + geom_point()

Task-2:Generating a scatter plot using ggplot, where the x-axis represents the mean TV hours (tvhours), and the y-axis represents the relig variable reordered by mean TV hours.

ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()

Task-3:Creating a scatter plot using ggplot.

relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()

Task-4:Generating a scatter plot using ggplot

rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()

Task-5: creates a scatter plot of the average age by income level, with “Not applicable” as the reference level for income

ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
  geom_point()

Task-6:calculating the proportion of each marital status group across different age groups and creates a line plot showing the distribution of marital status proportions by age.

by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE)


ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
  geom_line() +
  labs(colour = "marital")

Task-7: Adjusting the order of the “marital” variable based on frequency and then reverses the order before generating a bar plot illustrating the distribution of marital status.

gss_cat %>%
  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
  ggplot(aes(marital)) +
    geom_bar()

Modifying factor levels

Task-1: counting the frequency of each unique value in the “partyid” variable of the “gss_cat” dataset.

gss_cat%>%count(partyid)

Task-2:Recording the levels of the “partyid” variable in the “gss_cat” dataset and then counts the frequency of each unique recorded value.

gss_cat %>%
  mutate( partyid=fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
    ))%>%
  count(partyid)

Task-3:Recategorizing and counting party affiliations in the “gss_cat” dataset.

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party"
  )) %>%
  count(partyid)

Task-4: Collapsing categories within the “partyid” variable in the “gss_cat” dataset into broader groups and then counting the frequency of each collapsed category.

gss_cat%>%
  mutate(partyid=fct_collapse(partyid,
                              other=c("No answer", "Don't know", "Other party"),
                              rep=c("Strong republican", "Not str republican"),
                              ind=c("Ind,near rep", "Independent", "Ind,near dem"),
                              dem=c("Not str democrat", "Strong democrat"))) %>%
  count(partyid)

Task-5:Counting and aggregating religious affiliations in the “gss_cat” dataset after lumping together less frequent categories.

gss_cat %>%
  mutate(relig = fct_lump(relig)) %>%
  count(relig)

Task-6:“Summarizing religious affiliations after lumping infrequent categories and sort.”

gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print(n = Inf)

CH-Data and Times

Task-1:Loading library

library(tidyverse)

library(lubridate)
library(nycflights13)

Creating dates/times

Task-1: Printing current date or date-time

today()
[1] "2024-05-04"
now()
[1] "2024-05-04 20:51:16 +0545"

Form strings

Task-2:converting date strings to date objects in different formats.

ymd("2017-01-31")
[1] "2017-01-31"
mdy("January 31st, 2017")
[1] "2017-01-31"
dmy("31-Jan-2017")
[1] "2017-01-31"
ymd(20170131)
[1] "2017-01-31"
ymd_hms("2017-01-31 20:11:59")
[1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
[1] "2017-01-31 08:01:00 UTC"
flights %>% 
  select(year, month, day, hour, minute)
flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute))

Task: Creating date-time objects from hour-minute time data in the ‘flights’ dataset and filtering out rows with missing departure or arrival times

make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
}

flights_dt <- flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) %>% 
  select(origin, dest, ends_with("delay"), ends_with("time"))

flights_dt

Task: Plotting the frequency of flights over time using departure date-time

flights_dt %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 86400) 

Task: Plotting the frequency of flights over time for a specific period using departure date-time

flights_dt %>% 
  filter(dep_time < ymd(20130102)) %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 600) # 600 s = 10 minutes

Task: to convert today’s date to date-time object

as_datetime(today())
[1] "2024-05-04 UTC"
as_date(now())
[1] "2024-05-04"
as_date(365 * 10 + 2)
[1] "1980-01-01"

Date-time components Task: Extracting various components of a date-time object

datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
[1] 2016
month(datetime)
[1] 7
mday(datetime)
[1] 8
yday(datetime)
[1] 190
wday(datetime)
[1] 6
month(datetime, label = TRUE)
[1] Jul
Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec
wday(datetime, label = TRUE, abbr = FALSE)
[1] Friday
Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday

Task: Plotting the frequency of flights by day of the week

flights_dt %>% 
  mutate(wday = wday(dep_time, label = TRUE)) %>% 
  ggplot(aes(x = wday)) +
    geom_bar()

Task: Plotting average delay by minute of departure time

flights_dt %>% 
  mutate(minute = minute(dep_time)) %>% 
  group_by(minute) %>% 
  summarise(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n = n()) %>% 
  ggplot(aes(minute, avg_delay)) +
    geom_line()

Task: Plotting average delay by minute of scheduled departure time

sched_dep <- flights_dt %>% 
  mutate(minute = minute(sched_dep_time)) %>% 
  group_by(minute) %>% 
  summarise(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n = n())

ggplot(sched_dep, aes(minute, avg_delay)) +
  geom_line()

Task: Plotting the number of flights by minute of scheduled departure time

ggplot(sched_dep, aes(minute, n)) +
  geom_line()

Rounding Task:Plotting the number of flights by week, rounding to the nearest week

flights_dt %>% 
  count(week = floor_date(dep_time, "week")) %>% 
  ggplot(aes(week, n)) +
    geom_line()

setting compounds Task: Setting up a date-time object

(datetime <- ymd_hms("2016-07-08 12:34:56"))
[1] "2016-07-08 12:34:56 UTC"
year(datetime) <- 2020
datetime
[1] "2020-07-08 12:34:56 UTC"
month(datetime) <- 01
datetime
[1] "2020-01-08 12:34:56 UTC"
hour(datetime) <- hour(datetime) + 1
datetime
[1] "2020-01-08 13:34:56 UTC"
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
[1] "2020-02-02 02:34:56 UTC"
ymd("2015-02-01") %>% 
  update(mday = 30)
[1] "2015-03-02"
ymd("2015-02-01") %>% 
  update(hour = 400)
[1] "2015-02-17 16:00:00 UTC"

Task: Creating a new variable ‘dep_hour’ by updating the ‘dep_time’ to the first day of the year

flights_dt %>% 
  mutate(dep_hour = update(dep_time, yday = 1)) %>% 
  ggplot(aes(dep_hour)) +
    geom_freqpoly(binwidth = 300)

Time Spans Compute the age of a person based on their birthdate and today’s date

h_age <- today() - ymd(19791014)
h_age
Time difference of 16274 days
as.duration(h_age)
[1] "1406073600s (~44.56 years)"
dseconds(15)
[1] "15s"
dminutes(10)
[1] "600s (~10 minutes)"
dhours(c(12, 24))
[1] "43200s (~12 hours)" "86400s (~1 days)"  
ddays(0:5)
[1] "0s"                "86400s (~1 days)"  "172800s (~2 days)" "259200s (~3 days)" "345600s (~4 days)"
[6] "432000s (~5 days)"
dweeks(3)
[1] "1814400s (~3 weeks)"
dyears(1)
[1] "31557600s (~1 years)"
2 * dyears(1)
[1] "63115200s (~2 years)"
dyears(1) + dweeks(12) + dhours(15)
[1] "38869200s (~1.23 years)"
tomorrow <- today() + ddays(1)
last_year <- today() - dyears(1)
one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")
one_pm
[1] "2016-03-12 13:00:00 EST"
one_pm + ddays(1)
[1] "2016-03-13 14:00:00 EDT"

Periods Create period objects representing different time spans and Perform arithmetic operations with period objects

one_pm
[1] "2016-03-12 13:00:00 EST"
one_om = days(1)
seconds(15)
[1] "15S"
minutes(10)
[1] "10M 0S"
hours(c(12, 24))
[1] "12H 0M 0S" "24H 0M 0S"
days(7)
[1] "7d 0H 0M 0S"
months(1:6)
[1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S" "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
weeks(3)
[1] "21d 0H 0M 0S"
years(1)
[1] "1y 0m 0d 0H 0M 0S"
10 * (months(6) + days(1))
[1] "60m 10d 0H 0M 0S"
days(50) + hours(25) + minutes(2)
[1] "50d 25H 2M 0S"
ymd("2016-01-01") + dyears(1)
[1] "2016-12-31 06:00:00 UTC"
ymd("2016-01-01") + years(1)
[1] "2017-01-01"
one_pm + ddays(1)
[1] "2016-03-13 14:00:00 EDT"
one_pm + days(1)
[1] "2016-03-13 13:00:00 EDT"

Filter flights where arrival time is before departure time

flights_dt %>% 
  filter(arr_time < dep_time) 

Update flights data to correct overnight flights

flights_dt <- flights_dt %>% 
  mutate(
    overnight = arr_time < dep_time,
    arr_time = arr_time + days(overnight * 1),
    sched_arr_time = sched_arr_time + days(overnight * 1)
  )

Filter flights where overnight condition is true and arrival time is before departure time

flights_dt %>% 
  filter(overnight, arr_time < dep_time) 

Intervals Calculate the ratio of one year in days

years(1) / days(1)
[1] 365.25
next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
[1] 365
(today() %--% next_year) %/% days(1)
[1] 365

Display time zone information

Sys.timezone()
[1] "Asia/Katmandu"
length(OlsonNames())
[1] 596
head(OlsonNames())
[1] "Africa/Abidjan"     "Africa/Accra"       "Africa/Addis_Ababa" "Africa/Algiers"     "Africa/Asmara"     
[6] "Africa/Asmera"     
(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
[1] "2015-06-01 12:00:00 EDT"
(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))
[1] "2015-06-01 18:00:00 CEST"
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
[1] "2015-06-02 04:00:00 NZST"
x1 - x2
Time difference of 0 secs
x1 - x3
Time difference of 0 secs

Pipes

Task: To import the required library

packages_to_install <- c("tidyverse", "pryr")
for (package_name in packages_to_install) {
  if (!requireNamespace(package_name, quietly = TRUE)) {
    install.packages(package_name)
  }
  library(package_name, character.only = TRUE)
}

library(magrittr)

Create diamond data and calculate the object sizes

diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>% 
  dplyr::mutate(price_per_carat = price / carat)

pryr::object_size(diamonds)
3.46 MB
pryr::object_size(diamonds2)
3.89 MB
pryr::object_size(diamonds, diamonds2)
3.89 MB

Functions Normalize the columns of a data frame

df <- tibble::tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / 
  (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / 
  (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / 
  (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))

Normalize a single column of a data frame

(df$a - min(df$a, na.rm = TRUE)) /
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
 [1] 0.2660918 0.1288832 0.0769690 0.3163641 0.5612945 0.6241704 0.5271891 0.0000000 0.3913369 1.0000000
x <- df$a
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
 [1] 0.2660918 0.1288832 0.0769690 0.3163641 0.5612945 0.6241704 0.5271891 0.0000000 0.3913369 1.0000000
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
 [1] 0.2660918 0.1288832 0.0769690 0.3163641 0.5612945 0.6241704 0.5271891 0.0000000 0.3913369 1.0000000
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5, 10))
[1] 0.0 0.5 1.0

Rescale a vector to the range [0, 1]

rescale01(c(-10, 0, 10))
[1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
[1] 0.00 0.25 0.50   NA 1.00

Rescale each column of a DataFrame to the range [0, 1]

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
x <- c(1:10, Inf)
rescale01(x)
 [1]   0   0   0   0   0   0   0   0   0   0 NaN

Define the rescale01 function and apply it

rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
 [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667 0.7777778 0.8888889 1.0000000
[11]       Inf

Load required libraries and packages

library(tidyverse)
library(purrr)
library(magrittr)

# install.packages("pryr")
library(pryr)

18.2 Piping alternatives

This is a popular Children’s poem that is accompanied by hand actions.We’ll start by defining an object to represent little bunny Foo Foo:

# foo_foo <- little_bunny()

18.2.1 Intermediate steps

The simplest approach is to save each step as a new object:

# foo_foo_1 <- hop(foo_foo,through=forest)
# foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
# foo_foo_3 <- bop(foo_foo_2, on = head)

Create diamonds dataset and calculate price per carat

diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>% 
  dplyr::mutate(price_per_carat=price/carat)

pryr::object_size(diamonds)
3.46 MB
pryr::object_size(diamonds2)
3.89 MB
pryr::object_size(diamonds,diamonds2)
3.89 MB

Introduce NA value into diamonds$carat and check object sizes

diamonds$carat[1] <- NA
pryr::object_size(diamonds)
3.46 MB
pryr::object_size(diamonds2)
3.89 MB
pryr::object_size(diamonds,diamonds2)
4.32 MB

18.2.2 Overwrite the original

Instead of creating intermediate objects at each step, we could overwrite the original object:

# foo_foo <- hop(foo_foo, through = forest)
# foo_foo <- scoop(foo_foo, up = field_mice)
# foo_foo <- bop(foo_foo, on = head)

18.2.3 Function composition

Another approach is to abandon assignment and just string the function calls together:

# bop(
#   scoop(
#     hop(foo_foo, through = forest),
#     up = field_mice
#   ), 
#   on = head
# )

Here the disadvantage is that you have to read from inside-out, from right-to-left, and that the arguments end up spread far apart (evocatively called the dagwood sandwhich problem). In short, this code is hard for a human to consume.

18.2.4 Use the pipe

Finally, we can use the pipe:

# foo_foo %>%
#   hop(through = forest) %>%
#   scoop(up = field_mice) %>%
#   bop(on = head)
# my_pipe <- function(.) {
#   . <- hop(., through = forest)
#   . <- scoop(., up = field_mice)
#   bop(., on = head)
# }
# my_pipe(foo_foo)

TASK: Functions that use the current environment. For example, assign() will create a new variable with the given name in the current environment:

assign("x",10)
x
[1] 10
"x" %>% assign(100)
x
[1] 10

Assign value to “x” in the specified environment and check its value and Generate random numbers, create a matrix, plot it, and inspect its structure

env <- environment()
"x" %>% assign(100,envir=env)
x
[1] 100
rnorm(100) %>% 
  matrix(ncol=2) %>% 
  plot() %>% 
  str()
 NULL

rnorm(100) %>% 
  matrix(ncol=2) %>% 
  plot() %>% 
  str()
 NULL

ndist <- rnorm(100000)
hist(ndist)

Calculate the correlation between two variables in mtcars dataset

mtcars %$%
  cor(disp, mpg)
[1] -0.8475514
  • For assignment magrittr provides the %<>% operator which allows you to replace code like:
mtcars <- mtcars %>% 
  transform(cyl=cyl*2)
mtcars %<>% transform(cyl=cyl*2)

Chapter 19 Functions

19.1 Introduction

19.2 When should you write a function?

df <- tibble::tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df

df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / 
  (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / 
  (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / 
  (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))

Rescale a single variable in a data frame

(df$a - min(df$a, na.rm = TRUE)) /
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
 [1] 0.5235603 0.5627738 0.2777349 0.0000000 1.0000000 0.1019340 0.3832672 0.9030480 0.6526363 0.6825633

Rescale a single variable without creating a new object

x <- df$a
(x - min(x, na.rm = T)) / (max(x, na.rm = T)-min(x, na.rm = T))
 [1] 0.5235603 0.5627738 0.2777349 0.0000000 1.0000000 0.1019340 0.3832672 0.9030480 0.6526363 0.6825633

Task: There is some duplication in this code. We’re computing the range of the data three times, so it makes sense to do it in one step:

rng <- range(x, na.rm = T)
(x-rng[1])/(rng[2]-rng[1])
 [1] 0.5235603 0.5627738 0.2777349 0.0000000 1.0000000 0.1019340 0.3832672 0.9030480 0.6526363 0.6825633

Pulling out intermediate calculations into named variables is a good practice because it makes it more clear what the code is doing. Now that I’ve simplified the code, and checked that it still works, I can turn it into a function:

rescale01 <- function(x){
  rng <- range(x, na.rm = T)
  (x-rng[1])/(rng[2]-rng[1])
}
rescale01(c(0,5,10))
[1] 0.0 0.5 1.0

Test the rescale01 function with various inputs

rescale01(c(-10,0,10))
[1] 0.0 0.5 1.0
rescale01(c(1,2,3,NA,5))
[1] 0.00 0.25 0.50   NA 1.00

We can simplify the original example now that we have a function:

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)

Rescale a vector with infinite values

x <- c(1:10,Inf)
rescale01(x)
 [1]   0   0   0   0   0   0   0   0   0   0 NaN

Because we’ve extracted the code into a function, we only need to make the fix in one place:

rescale01 <- function(x){
  rng <- range(x,na.rm=T,finite=T)
  (x-rng[1])/(rng[2]-rng[1])
}
rescale01(x)
 [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667 0.7777778 0.8888889 1.0000000
[11]       Inf

19.4 Conditional execution

An if statement allows you to conditionally execute code. It looks like this:

# if (condition) {
  # code executed when condition is TRUE
# } else {
  # code executed when condition is FALSE
# }

Define a function to check if an object has names

has_name <- function(x){
  nms <- names(x)
  if(is.null(nms)){
    rep(FALSE,length(x))
  }else {
    !is.na(nms) & nms !=""
  }
}

19.4.1 Conditions

how if condition works with warnings

# if (c(TRUE,FALSE)){}
#> Warning in if (c(TRUE, FALSE)) {: the condition has length > 1 and only the
#> first element will be used
#> NULL

# if (NA) {}

Check if two objects are identical

identical(0L,0)
[1] FALSE
x <- sqrt(2)^2
x==2
[1] FALSE
x-2
[1] 4.440892e-16

19.4.2 Multiple conditions

You can chain multiple if statement together:

# if (this) {
#   # do that
# } else if (that) {
#   # do something else
# } else {
#   # 
# }
#> function(x, y, op) {
#>   switch(op,
#>     plus = x + y,
#>     minus = x - y,
#>     times = x * y,
#>     divide = x / y,
#>     stop("Unknown op!")
#>   )
#> }

19.4.3 Code style

Good practice for writing if statements

# Good
# if (y < 0 && debug) {
#   message("Y is negative")
# }
# 
# if (y == 0) {
#   log(x)
# } else {
#   y ^ x
# }
# 
# # Bad
# if (y < 0 && debug)
# message("Y is negative")
# 
# if (y == 0) {
#   log(x)
# } 
# else {
#   y ^ x
# }

It’s ok to drop the curly braces if you have a very short if statement that can fit on one line:

y <- 10
x <- if (y < 20) "Too low" else "Too high"

I recommend this only for very brief if statements. Otherwise, the full form is easier to read:

if (y < 20) {
  x <- "Too low" 
} else {
  x <- "Too high"
}

19.5 Function arguments

# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
  se <- sd(x) / sqrt(length(x))
  alpha <- 1 - conf
  mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}

x <- runif(100)
mean_ci(x)
[1] 0.4370008 0.5485763
mean_ci(x, conf = 0.99)
[1] 0.4194710 0.5661061

19.5.1 Choosing names

19.5.2 Cheking values

wt_mean <- function(x, w) {
  sum(x * w) / sum(w)
}
wt_var <- function(x, w) {
  mu <- wt_mean(x, w)
  sum(w * (x - mu) ^ 2) / sum(w)
}
wt_sd <- function(x, w) {
  sqrt(wt_var(x, w))
}

What happens if x and w are not the same length?

wt_mean(1:6, 1:3)
[1] 7.666667

In this case, because of R’s vector recycling rules, we don’t get an error.

It’s good practice to check important preconditions, and throw an error (with stop()), if they are not true:

wt_mean <- function(x, w) {
  if (length(x) != length(w)) {
    stop("`x` and `w` must be the same length", call. = FALSE)
  }
  sum(w * x) / sum(w)
}
wt_mean <- function(x, w, na.rm = FALSE) {
  if (!is.logical(na.rm)) {
    stop("`na.rm` must be logical")
  }
  if (length(na.rm) != 1) {
    stop("`na.rm` must be length 1")
  }
  if (length(x) != length(w)) {
    stop("`x` and `w` must be the same length", call. = FALSE)
  }
  
  if (na.rm) {
    miss <- is.na(x) | is.na(w)
    x <- x[!miss]
    w <- w[!miss]
  }
  sum(w * x) / sum(w)
}

This is a lot of extra work for little additional gain. A useful compromise is the built-in stopifnot(): it checks that each argument is TRUE, and produces a generic error message if not.

wt_mean <- function(x, w, na.rm = FALSE) {
  stopifnot(is.logical(na.rm), length(na.rm) == 1)
  stopifnot(length(x) == length(w))
  
  if (na.rm) {
    miss <- is.na(x) | is.na(w)
    x <- x[!miss]
    w <- w[!miss]
  }
  sum(w * x) / sum(w)
}

19.5.3 Dot-dot-dot(…)

Many functions in R take an arbitrary number of inputs:

sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
[1] 55
stringr::str_c("a", "b", "c", "d", "e", "f")
[1] "abcdef"

Define a function to concatenate strings with commas

commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])
[1] "a, b, c, d, e, f, g, h, i, j"
rule <- function(..., pad = "-") {
  title <- paste0(...)
  width <- getOption("width") - nchar(title) - 5
  cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Important output")
Important output ----------------------------------------------------------------------------------------
x <- c(1,2)
sum(x,na.rm=T)
[1] 3

Define a function ‘complicated_function’ with conditions to return 0 if ‘x’ or ‘y’ is empty

complicated_function <- function(x,y,z){
  if (lenth(x)==0 || length(y)==0){
    return(0)
  }
}

Improve readability of if-else blocks by using early return for simple cases

f <- function() {
  if (x) {
    # Do 
    # something
    # that
    # takes
    # many
    # lines
    # to
    # express
  } else {
    # return something short
  }
}

But if the first block is very long, by the time you get to the else, you’ve forgotten the condition. One way to rewrite it is to use an early return for the simple case:

f <- function() {
  if (!x) {
    return(something_short)
  }

  # Do 
  # something
  # that
  # takes
  # many
  # lines
  # to
  # express
}

This tends to make the code easier to understand, because you don’t need quite so much context to understand it.

19.6.2 Writing pipeable functions

Define a function to show the count of missing values in a data frame

show_missing <- function(df){
  n <- sum(is.na(df))
  cat("Missing values:",n,"\n",sep="")
  
  invisible(df)
}

If we call it interatively, the invisible() means that the input df does not get printed out:

show_missing(mtcars)
Missing values:0

But it’s still there, it’s not just printed by default:

x <- show_missing(mtcars)
Missing values:0
class(x)
[1] "data.frame"
dim(x)
[1] 32 11

And we can still use it in a pipe:

library(magrittr)
library(tidyverse)

mtcars %>% 
  show_missing() %>% 
  mutate(mpg=ifelse(mpg<20,NA,mpg)) %>% 
  show_missing()
Missing values:0
Missing values:18

19.7 Environment

Define a function ‘f’ that takes an argument ‘x’ and returns the sum of ‘x’ and ‘y’

f <- function(x){
  x+y
}

Demonstrate how changing the value of ‘y’ affects the result of calling function ‘f’

y <- 100
f(10)
[1] 110
y <- 1000
f(10)
[1] 1010

Overload the ‘+’ operator to behave differently based on a random condition

`+` <- function(x, y) {
  if (runif(1) < 0.1) {
    sum(x, y)
  } else {
    sum(x, y) * 1.1
  }
}
table(replicate(1000, 1 + 2))

  3 3.3 
 94 906 
#> 
#>   3 3.3 
#> 100 900
rm(`+`)

Chapter 20: Vectors

20.1.1 PRerequisites

library(tidyverse)

20.2 Vector basics

Determine the data type of different vectors

typeof(letters)
[1] "character"
typeof(1:10)
[1] "integer"

Determine the length of a list and display its contents

x <- list("a","b",1:10)
length(x)
[1] 3
x
[[1]]
[1] "a"

[[2]]
[1] "b"

[[3]]
 [1]  1  2  3  4  5  6  7  8  9 10

Demonstrate modulo operation and creation of logical vectors

1:10 %% 3 ==0
 [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
c(T,T,F,NA)
[1]  TRUE  TRUE FALSE    NA

20.3.2 Numeric

Integer and double vectors are known collectively as numeric vectors. In R, numbers are doubles by default. To make an integer, place an L after the number:

typeof(1)
[1] "double"
typeof(1L)
[1] "integer"
1.5
[1] 1.5

Demonstrate the behavior of floating point arithmetic

x <- sqrt(2)^2
x
[1] 2
x-2
[1] 4.440892e-16

Demonstrate the behavior of division by zero

c(-1,0,1)%/% 0
[1] -Inf  NaN  Inf
# [1] -Inf  NaN  Inf

20.3.3 Character

Determine the memory size of a string and a replicated string vector

x <- "This is a reasonably long string."
pryr::object_size(x)
152 B
y <- rep(x,1000)
pryr::object_size(y)
8.14 kB

20.3.4 Missing values

Note that each type of atomic vector has its own missing value:

NA            # logical
[1] NA
NA_integer_   # integer
[1] NA
NA_real_      # double
[1] NA
NA_character_ # character
[1] NA

Calculate the number and proportion of elements in a vector greater than 10

x <- sample(20,100,replace=T)
y <- x > 10
sum(y) # how many are greater than 10?
[1] 49
mean(y) # what proportion are greater than 10?
[1] 0.49
if (length(x)){
}
NULL

Determine the data type of different vectors

typeof(c(TRUE,1L))
[1] "integer"
typeof(c(1L,1.5))
[1] "double"
typeof(c(1.5,"a"))
[1] "character"

Generate random numeric or logical vectors

sample(10)+100
 [1] 110 103 104 101 102 107 109 106 105 108
runif(10)>0.5
 [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE

Demonstrate vector arithmetic with vectors of different lengths

1:10 +1:2
 [1]  2  4  4  6  6  8  8 10 10 12
1:10+1:3
Warning: longer object length is not a multiple of shorter object length
 [1]  2  4  6  5  7  9  8 10 12 11

Create a tibble with two columns, ‘x’ and ‘y’, with different lengths

library(tidyverse)



tibble(
  x=1:4,
  y=rep(1:2,each=2)
)

20.4.4 Naming vectors

All types of vectors can be named. You can name them during creatin with c():

c(x=1,y=2,z=4)
x y z 
1 2 4 

Or after the fact with purr::set_names()

set_names(1:3,c("a","b","c"))
a b c 
1 2 3 

Named vectors are most useful for subsetting, described next.

20.4.5 Subsetting

Demonstrate subsetting vectors with integer indices

x <- c("one","two","three","four","five")
x[c(3,2,5)]
[1] "three" "two"   "five" 

By repeating a position, you can actually make a longer output than input:

x[c(1,1,5,5,5,2)]
[1] "one"  "one"  "five" "five" "five" "two" 

Negative values drop the elements at the specified positions:

x[c(-1,-3,-5)]
[1] "two"  "four"

The error message mentions subsetting with zero, which returns no values:

x[0]
character(0)
library(tidyverse)
x <- c(10,3,NA,5,8,1)

# tibble test
x <- as.tibble(x,ncol=1)
Warning: `as.tibble()` was deprecated in tibble 2.0.0.
Please use `as_tibble()` instead.
The signature and semantics have changed, see `?as_tibble`.
names(x)="v1"
is.na(x)
        v1
[1,] FALSE
[2,] FALSE
[3,]  TRUE
[4,] FALSE
[5,] FALSE
[6,] FALSE
x %>% filter(v1 == NA)

# all non-missing values of x
x <- c(10,3,NA,5,8,1)
x[!is.na(x)]
[1] 10  3  5  8  1
# all even (or missing) values of x
x[x %% 2==0]
[1] 10 NA  8
  1. If you have a named vector, you can subset it with a character vector:
x <- c(abc=1, def=2,xyz=5)
x[c("xyz","def")]
xyz def 
  5   2 

20.5 Recursive vectors (lists)

Create a list with numeric elements

x <- list(1,2,3)
x
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

Display the structure of lists with and without names

str(x)
List of 3
 $ : num 1
 $ : num 2
 $ : num 3
x_named <- list(a=1,b=2,c=3)
str(x_named)
List of 3
 $ a: num 1
 $ b: num 2
 $ c: num 3

Unlike atomic vectors, list() can contain a mix of objects:

y <- list("a",1L,1.5,T)
str(y)
List of 4
 $ : chr "a"
 $ : int 1
 $ : num 1.5
 $ : logi TRUE

List can even contain other lists!

z <- list(list(1,2),list(3,4))
str(z)
List of 2
 $ :List of 2
  ..$ : num 1
  ..$ : num 2
 $ :List of 2
  ..$ : num 3
  ..$ : num 4

20.5.1 Visualizing lists

x1 <- list(c(1,2),c(3,4))
x2 <- list(list(1,2),list(3,4))
x3 <- list(1,list(2,list(3)))
x1
[[1]]
[1] 1 2

[[2]]
[1] 3 4
x2
[[1]]
[[1]][[1]]
[1] 1

[[1]][[2]]
[1] 2


[[2]]
[[2]][[1]]
[1] 3

[[2]][[2]]
[1] 4
x3
[[1]]
[1] 1

[[2]]
[[2]][[1]]
[1] 2

[[2]][[2]]
[[2]][[2]][[1]]
[1] 3

20.5.2 Subsetting

Create a list ‘a’ with named elements and demonstrate subsetting

a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
str(a)
List of 4
 $ a: int [1:3] 1 2 3
 $ b: chr "a string"
 $ c: num 3.14
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5
str(a[1:2])
List of 2
 $ a: int [1:3] 1 2 3
 $ b: chr "a string"
str(a[4])
List of 1
 $ d:List of 2
  ..$ : num -1
  ..$ : num -5

Demonstrate subsetting lists using double square brackets

str(a[[1]])
 int [1:3] 1 2 3
str(a[[4]])
List of 2
 $ : num -1
 $ : num -5

Access list elements by name using $ or [[ ]]

a$a
[1] 1 2 3
a[["a"]]
[1] 1 2 3

20.6 Attributes

Demonstrate setting and retrieving attributes of vectors

x <- 1:10
attr(x,"greeting")
NULL
attr(x,"greeting") <- "Hi!"
attr(x,"farewell") <- "Bye!"
attributes(x)
$greeting
[1] "Hi!"

$farewell
[1] "Bye!"

Demonstrate methods for class ‘Date’

as.Date
function (x, ...) 
UseMethod("as.Date")
<bytecode: 0x0000022ea4225378>
<environment: namespace:base>
methods("as.Date")
[1] as.Date.character   as.Date.default     as.Date.factor      as.Date.numeric     as.Date.POSIXct    
[6] as.Date.POSIXlt     as.Date.vctrs_sclr* as.Date.vctrs_vctr*
see '?methods' for accessing help and source code

Retrieve specific methods for ‘as.Date’

getS3method("as.Date","default")
function (x, ...) 
{
    if (inherits(x, "Date")) 
        x
    else if (is.null(x)) 
        .Date(numeric())
    else if (is.logical(x) && all(is.na(x))) 
        .Date(as.numeric(x))
    else stop(gettextf("do not know how to convert '%s' to class %s", 
        deparse1(substitute(x)), dQuote("Date")), domain = NA)
}
<bytecode: 0x0000022eb85b7c20>
<environment: namespace:base>
getS3method("as.Date","numeric")
function (x, origin, ...) 
if (missing(origin)) .Date(x) else as.Date(origin, ...) + x
<bytecode: 0x0000022ec2ac86e0>
<environment: namespace:base>

20.7.1 Factors

Demonstrate creating a factor and inspecting its attributes

x <- factor(c("ab","cd","ab"),levels=c("ab","cd","ed"))
typeof(x)
[1] "integer"
attributes(x)
$levels
[1] "ab" "cd" "ed"

$class
[1] "factor"

20.7.2 Dates and date-times

Dates in R are numeric vectors that represent the number of days since 1 January 1970.

x <- as.Date("1971-01-01")
unclass(x)
[1] 365
typeof(x)
[1] "double"
attributes(x)
$class
[1] "Date"

Demonstrate creating and inspecting a date-time object

x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)
[1] 3600
attr(,"tzone")
[1] "UTC"
typeof(x)
[1] "double"
attributes(x)
$class
[1] "POSIXct" "POSIXt" 

$tzone
[1] "UTC"

Demonstrate setting and retrieving time zone for date-time object

attr(x,"tzone") <- "US/Pacific"
x
[1] "1969-12-31 17:00:00 PST"
attr(x,"tzone") <- "US/Eastern"
x
[1] "1969-12-31 20:00:00 EST"

There is another type of date-times called POSIXIt. There are built on top of named lists:

y <- as.POSIXlt(x)
typeof(y)
[1] "list"
#> [1] "list"
attributes(y)
$names
 [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"   "yday"   "isdst"  "zone"   "gmtoff"

$class
[1] "POSIXlt" "POSIXt" 

$tzone
[1] "US/Eastern" "EST"        "EDT"       

$balanced
[1] TRUE

20.7.3 Tibbles

Tibbles are augmented lists: they have class “tbl_df” + “tbl” + “data.frame”, and names (column) and row.names attributes:

tb <- tibble::tibble(x = 1:5, y = 5:1)
typeof(tb)
[1] "list"
attributes(tb)
$class
[1] "tbl_df"     "tbl"        "data.frame"

$row.names
[1] 1 2 3 4 5

$names
[1] "x" "y"
df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
[1] "list"
attributes(df)
$names
[1] "x" "y"

$class
[1] "data.frame"

$row.names
[1] 1 2 3 4 5

Chapter 21: Iteration

21.1.1 Prerequisites

library(tidyverse)

21.2 For loops

Imagine we have this simple tibble:

df <- tibble(
  a=rnorm(10),
  b=rnorm(10),
  c=rnorm(10),
  d=rnorm(10)
)

Calculate the median for each column in a tibble

median(df$a)
[1] -0.3157254
median(df$b)
[1] -0.8006407
median(df$c)
[1] -0.2668019
median(df$d)
[1] -0.02814063

Calculate the median for each column in the data frame ‘df’ using a for loop

df
output <- vector("double",ncol(df))
for (i in seq_along(df)){
  output[[i]] <- median(df[[i]])
}
output <- tibble(output)

Demonstrate the behavior of seq_along and length functions with an empty vector ‘y’

y <- vector("double", 0)
seq_along(y)
integer(0)
#> integer(0)
1:length(y)
[1] 1 0
#> [1] 1 0

21.3.1v Modifying an existing object

Sometimes, you want to use a for loop to modify an existing object. For example, remember our challenges from functions. We wanted to rescale every column in a data frame:

library(tidyverse)

df <- tibble(
  a=rnorm(10),
  b=rnorm(10),
  c=rnorm(10),
  d=rnorm(10)
)

rescale01 <- function(x){
  rng <- range(x,na.rm=T)
  (x-rng[1])/(rng[2]-rng[1])
}

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)

df
for ( i in seq_along(df)){
  df[[i]] <- rescale01(df[[i]])
}

21.3.2 Looping patterns

x
[1] "1969-12-31 20:00:00 EST"
results <- vector("list",length(x))
names(results) <- names(x)

Demonstrate looping patterns using a for loop to iterate over a list ‘x’ and store results in a list ‘results’

for(i in seq_along(x)){
  name <- names(x)[[i]]
  value <- x[[i]]
}

21.3.3 Unknown output length

Create a vector ‘output’ with unknown length and store results from a for loop in it

means <- c(0,1,2)

output <- double()
for (i in seq_along(means)){
  n <- sample(100,1)
  output <- c(output,rnorm(n,means[[i]]))
}
str(output)
 num [1:223] 2.4083 1.5499 0.6081 0.0844 0.7443 ...
output
  [1]  2.40834490  1.54985893  0.60813582  0.08444820  0.74433326  0.23589873  0.13677913 -1.51138770
  [9] -1.27301392 -1.76413099 -0.63497070  1.54956856  1.50375944  0.71571312  0.57801330  0.33952611
 [17]  0.55112157  0.17114550 -0.45463725  1.16554397  0.69994812 -1.38517572 -0.21089332  0.59729886
 [25]  0.96649672  0.27565281  2.07450311  0.94767106 -1.19450592 -1.17615918 -0.06135937  0.31565475
 [33]  0.46863199 -2.44533524  1.06774440 -0.53263928  0.79354070 -1.03657232 -1.41232073 -1.32268012
 [41] -0.80619868 -1.48689463 -1.54482571  1.03872808 -1.69903338 -1.03393281  1.63922764  1.21681751
 [49] -1.50423215  0.08619177 -0.64176595 -0.43528690  2.35908464 -0.07057289  0.79716367  1.39285456
 [57]  1.99268652  1.96561675  4.02439840  2.43533373  1.71102597 -1.18110259  0.59191107  1.10348971
 [65]  0.73300412  0.12159976  0.27290089  1.61147640  0.05298932  0.28489074  0.51530326  2.48330465
 [73]  1.37412695  1.25174310  1.51181660  0.51833391  1.22731542  0.94751275  2.14534938  3.73603533
 [81]  2.50448575 -0.07692596  0.38360741  1.13672743  1.59343745  1.59795077  1.43992598  1.32878827
 [89]  4.19537063  1.71172058  1.10573970  0.60868714  1.07205788  2.30309804  0.89510610  1.53102920
 [97]  1.34831205  1.42881339  1.98257191  1.81541903  1.01326728  0.19392951  0.33627612  0.47158036
[105]  2.83906013  0.38940341  1.87692593  0.86561042  1.79769711  1.32415638  1.66865453  2.99697358
[113]  0.46228139  0.33591193  0.24631821 -0.43897896 -0.11946077  1.11951899  2.53562810  1.15236359
[121] -0.09132156 -1.16767499  0.69617045  1.53977593 -0.25021887  2.18171198  1.97165693 -0.23020504
[129] -0.44086089  1.15684137  2.24634334  1.55135055  2.31883196  0.67556263  0.08018167  1.13714558
[137]  1.91669266  1.11139134  2.44770118  0.19471109  2.17871196  4.23369053 -0.15692858  2.21704642
[145]  2.90250569 -0.47232133  1.53732998  0.24186549  2.47811777  3.33046579  2.23734182  0.96433477
[153]  2.53867102  2.31206058  1.97274207  2.13828221  2.57547609  1.13761332  1.18604257  1.61607018
[161]  2.37121047  2.84795864  3.27669950  1.60560488  2.10355804  3.13176141  3.26092255  0.75913241
[169]  0.30603003  2.52789744  0.60599780  2.73159671  0.64479899 -0.11114226  0.61712535  3.05991705
[177]  1.35211904  1.02472332  1.29546010  2.25444890  1.57803759  0.76376634 -0.73903589  2.93147906
[185]  2.07135777  1.17651876  1.52535341  1.73003413  0.57964605  1.54963744  2.16950772  1.45261534
[193]  2.78626480  2.75094794  2.99609585  2.54613765  1.71969902  4.22074008  2.97537992  1.47012235
[201]  1.27664953  2.99738151  2.29757845  1.24353502  2.34876874  0.40451196  0.98056608  2.15815198
[209]  1.21343698  2.37080467  1.56698860  3.19657275  0.47615986  0.53829438  3.06730806  1.80701539
[217]  2.43974635  1.12710760  2.72351352  2.57993773  3.26991511  1.74089217  0.42477812

Create a list ‘out’ with unknown length and store results from a for loop in it

out <- vector("list",length(means))
for (i in seq_along(means)){
  n <- sample(100,1)
  out[[i]] <- rnorm(n,means[[i]])
}
str(out)
List of 3
 $ : num [1:23] 0.109 0.669 -0.159 -0.325 -0.81 ...
 $ : num [1:97] 1.57 2.15 2.41 1.75 1.1 ...
 $ : num [1:9] 1.86 2.62 2.25 1.62 2.24 ...
str(unlist(out))
 num [1:129] 0.109 0.669 -0.159 -0.325 -0.81 ...

21.3.4 Unknown sequence length

A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can’t rewrite every while loop as for loop:

for (i in seq_along(x)) {
  # body
}

# Equivalent to
i <- 1
while (i <= length(x)) {
  # body
  i <- i + 1 
}

Herhow we could use a while loop to find how many tries it takes to get three heads in a row:

flip <- function() sample(c("T", "H"), 1)

flips <- 0
nheads <- 0

while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips
[1] 26

21.4 For loops vs. functionals

Compare for loop and functional approaches for calculating column means in a data frame

df <- tibble(
  a=rnorm(10),
  b=rnorm(10),
  c=rnorm(10),
  d=rnorm(10)
)

Using for loop

output <- vector("double",length(df))
for (i in seq_along(df)){
  output[[i]] <- mean(df[[i]])
}
output
[1] 0.476473102 0.001854536 0.558698854 0.220409290

Using functional approach with a custom function ‘col_mean’

col_mean <- function(df){
  output <- vector("double",length(df))
  for (i in seq_along(df)){
    output[i] <- mean(df[[i]])
  }
  output
}

Define a function ‘col_median’ to calculate the median for each column in the data frame ‘df’

col_median <- function(df){
  output <- vector("double",hh(df))
  for (i in seq_along(df)){
    output[i] <- median(df[[i]])
  }
  output
}

col_sd <- function(df){
  output <- vector("double",length(df))
  for (i in seq_along(df)){
    output[i] <- sd(df[[i]])
  }
  output
}

df

Define functions f1, f2, and f3 for calculating different powers of absolute deviation from the mean

f1 <- function(x) abs(x-mean(x))^1
f2 <- function(x) abs(x-mean(x))^2
f3 <- function(x) abs(x-mean(x))^3

Define a function ‘f’ to calculate the absolute deviation from the mean raised to a given power ‘i’

f <- function(x,i) abs(x-mean(x))^i

Define a function ‘col_summary’ to apply a summary function ‘fun’ to each column of the data frame ‘df’

col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}
col_summary(df, median)
[1]  0.47327175 -0.06728873  0.35193999  0.23340748
col_summary(df, mean)
[1] 0.476473102 0.001854536 0.558698854 0.220409290

Demonstrate the use of ‘map_dbl’ from the ‘purrr’ package to apply a function to each column of the data frame ‘df’

library(purrr)
head(df)


# Reference - for loop()
output <- vector("double",length(df))
for (i in seq_along(df)){
  output[[i]] <- mean(df[[i]])
}
output
[1] 0.476473102 0.001854536 0.558698854 0.220409290
map_dbl(df,mean)
          a           b           c           d 
0.476473102 0.001854536 0.558698854 0.220409290 
map_dbl(df,median)
          a           b           c           d 
 0.47327175 -0.06728873  0.35193999  0.23340748 
map_dbl(df,sd)
        a         b         c         d 
1.1903576 0.9296956 0.7214505 0.6619177 
df %>% map_dbl(mean)
          a           b           c           d 
0.476473102 0.001854536 0.558698854 0.220409290 
df %>% map_dbl(median)
          a           b           c           d 
 0.47327175 -0.06728873  0.35193999  0.23340748 
df %>% map_dbl(sd)
        a         b         c         d 
1.1903576 0.9296956 0.7214505 0.6619177 

Demonstrate the use of ‘map_dbl’ from the ‘purrr’ package with additional arguments

map_dbl(df,mean,trim=0.5)
          a           b           c           d 
 0.47327175 -0.06728873  0.35193999  0.23340748 

Demonstrate the use of ‘map_int’ from the ‘purrr’ package to apply a function that returns integers to each element of a list

z <- list(x=1:3,y=4:5)
z
$x
[1] 1 2 3

$y
[1] 4 5
map_int(z,length)
x y 
3 2 

21.5.1 Shortcuts

Demonstrate the use of ‘safely’ from the ‘purrr’ package to create a safe version of a function

safe_log <- safely(log)
str(safe_log(10))
List of 2
 $ result: num 2.3
 $ error : NULL
str(safe_log("a"))
List of 2
 $ result: NULL
 $ error :List of 2
  ..$ message: chr "non-numeric argument to mathematical function"
  ..$ call   : language .Primitive("log")(x, base)
  ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

Demonstrate the use of ‘map’ from the ‘purrr’ package with ‘safely’ to apply a safe version of a function to each element of a list

x <- list(1,10,"a")
y <- x %>% map(safely(log))
str(y)
List of 3
 $ :List of 2
  ..$ result: num 0
  ..$ error : NULL
 $ :List of 2
  ..$ result: num 2.3
  ..$ error : NULL
 $ :List of 2
  ..$ result: NULL
  ..$ error :List of 2
  .. ..$ message: chr "non-numeric argument to mathematical function"
  .. ..$ call   : language .Primitive("log")(x, base)
  .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

Demonstrate the use of ‘transpose’ from the ‘purrr’ package to transpose a list of lists

y <- x %>% transpose()
str(y)
List of 1
 $ :List of 3
  ..$ : num 1
  ..$ : num 10
  ..$ : chr "a"

Demonstrate the use of error handling with ‘map_lgl’ and ‘is_null’ from the ‘purrr’ package

is_ok <- y$error %>% map_lgl(is_null)
x[!is_ok]
list()
# y$result[is_ok] %>% flatten_dbl()

Purrr provides two other useful adverbs:

x <- list(1,10,"a")
x %>% map_dbl(possibly(log,NA_real_))
[1] 0.000000 2.302585       NA

Demonstrate the use of ‘quietly’ from the ‘purrr’ package to suppress errors and return results with warnings

x <- list(1,-1)
x %>% map(quietly(log)) %>% str()
List of 2
 $ :List of 4
  ..$ result  : num 0
  ..$ output  : chr ""
  ..$ warnings: chr(0) 
  ..$ messages: chr(0) 
 $ :List of 4
  ..$ result  : num NaN
  ..$ output  : chr ""
  ..$ warnings: chr "NaNs produced"
  ..$ messages: chr(0) 

21.7 Mapping over multiple arguments

Generate random numbers from normal distributions with different means using ‘map’ from the ‘purrr’ package

mu <- list(5,10,-3)
mu %>% 
  map(rnorm,n=5) %>% 
  str()
List of 3
 $ : num [1:5] 5.11 4.32 5.23 5.09 4.36
 $ : num [1:5] 9.66 9.56 12.07 9.6 7.39
 $ : num [1:5] -2.57 -3.32 -4.2 -2.16 -2.32

Generate random numbers from normal distributions with different means and standard deviations using ‘map2’ from the ‘purrr’ package

sigma <- list(1,5,10)
seq_along(mu) %>% 
  map(~rnorm(5,mu[[.]],sigma[[.]])) %>% 
  str()
List of 3
 $ : num [1:5] 5.36 7.16 4.32 4.53 5.97
 $ : num [1:5] 21.09 8.26 15.76 8.76 8.33
 $ : num [1:5] -11.56 -11 -4.91 -10.91 -13.74

Define a custom ‘map2’ function to apply a binary function to corresponding elements of two lists

map2(mu,sigma,rnorm,n=5) %>% str()
List of 3
 $ : num [1:5] 4.85 4.27 4.85 3.68 3.97
 $ : num [1:5] 15.4 16.3 13.5 11.2 11.4
 $ : num [1:5] 4.561 -0.279 -11.559 0.298 -5.152
map2 <- function(x,y,f,...){
  out <- vector("list",length(x))
  for (i in seq_along(x)){
    out[[i]] <- f(x[[i]],y[[i]],...)
  }
  out
}

Apply a function to corresponding elements of multiple lists using ‘pmap’ from the ‘purrr’ package


library(magrittr)
library(purrr)

n <- list(1,3,5)
args1 <- list(n,mu,sigma)
args1 %>% 
  pmap(rnorm) %>% 
  str()
List of 3
 $ : num 3.71
 $ : num [1:3] 6.11 12.73 9.31
 $ : num [1:5] -12.562 -1.359 -0.823 -20.731 -12.676

Apply a function to corresponding elements of multiple lists with named parameters using ‘pmap’ from the ‘purrr’ package

args2 <- list(mean=mu, sd=sigma,n=n)
args2 %>% 
  pmap(rnorm) %>% 
  str()
List of 3
 $ : num 4.34
 $ : num [1:3] 9.78 8.19 11.21
 $ : num [1:5] 0.742 6.791 14.991 -1.91 0.511

Apply a function to corresponding rows of a data frame using ‘pmap’ from the ‘purrr’ package with a tibble

library(tidyverse)
parms <- tribble(
  ~mean,~sd,~n,
  5,1,1,
  10,5,3,
  -3,10,5
)

parms %>% 
  pmap(rnorm)
[[1]]
[1] 3.243648

[[2]]
[1]  9.605242 11.025883  6.007252

[[3]]
[1]  16.669768   5.855557 -12.841618  -3.360844   5.139707

21.7.1 Involing different functions

Invoke different functions with different parameters using ‘invoke_map’ from the ‘purrr’ package

f <- c("runif","rnorm","rpois")
param <- list(
  list(min=-1,max=1),
  list(sd=5),
  list(lambda=10)
)

f
[1] "runif" "rnorm" "rpois"
param
[[1]]
[[1]]$min
[1] -1

[[1]]$max
[1] 1


[[2]]
[[2]]$sd
[1] 5


[[3]]
[[3]]$lambda
[1] 10

To handle this case, you can use invoke_map():

invoke_map(f,param,n=5) %>% 
  str()
Warning: `invoke_map()` was deprecated in purrr 1.0.0.
Please use map() + exec() instead.
List of 3
 $ : num [1:5] 0.7386 -0.9649 -0.0834 -0.4986 -0.9628
 $ : num [1:5] 6.447 -6.531 -0.428 6.091 8.155
 $ : int [1:5] 9 11 9 9 11

Invoke different functions with different parameters using ‘pmap’ from the ‘purrr’ package and a tibble

sim <- tribble(
  ~f,      ~params,
  "runif", list(min = -1, max = 1),
  "rnorm", list(sd = 5),
  "rpois", list(lambda = 10)
)
sim %>% 
  mutate(sim = invoke_map(f, params, n = 10))

21.8 Walk

Perform side effects without returning a value for each element of a list using ‘walk’ from the ‘purrr’ package

x <- list(1,"a",3)
x %>% 
  walk(print)
[1] 1
[1] "a"
[1] 3

Perform side effects on each element of a list using ‘walk’ from the ‘purrr’ package, then save the results

library(ggplot2)
plots <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ggplot(., aes(mpg, wt)) + geom_point())
paths <- stringr::str_c(names(plots), ".pdf")

pwalk(list(paths, plots), ggsave, path = tempdir())
Saving 7 x 7 in image

Retain or remove elements of a list based on a predicate function using ‘keep’ and ‘discard’ from the ‘purrr’ package

iris %>% 
  keep(is.factor) %>% 
  str()
'data.frame':   150 obs. of  1 variable:
 $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
iris %>% 
  discard(is.factor) %>%
  str()
'data.frame':   150 obs. of  4 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
library(tidyverse)
library(magrittr)

21.9.2 Reduce and accumulate

Iteratively combine elements of a list using a binary function with ‘reduce’ from the ‘purrr’ package

dfs <- list(
  age=tibble(name="John",age=30),
  sex=tibble(name=c("John","Mary"),sex=c("M","F")),
  trt=tibble(name="Mary",treatment="A")
)

dfs %>% reduce(full_join)
Joining with `by = join_by(name)`Joining with `by = join_by(name)`

Find the intersection of multiple vectors using ‘reduce’ from the ‘purrr’ package

vs <- list(
  c(1,3,5,6,10),
  c(1,2,3,7,8,10),
  c(1,2,3,4,8,9,10)
)
vs %>% reduce(intersect)
[1]  1  3 10

Iteratively apply a function to elements of a list using ‘accumulate’ from the ‘purrr’ package

x <- sample(10)
x
 [1]  7  3  4  2  8  9  1  5 10  6
x %>% accumulate(`+`)
 [1]  7 10 14 16 24 33 34 39 49 55
---
title: "Introduction to R"
author: "Bibek Sapkota"
output:
  pdf_document: default
  html_notebook: default
---

# Tibbles

Task 1:Loading the tidyverse package.
```{r}
library(tidyverse)
```
Task 2:Converting the iris dataset to a tibble.
```{r}
as_tibble(iris)
```
Task 3: Creating a tibble with columns "x," "y," and "z," where "x" ranges from 1 to 5, "y" is 1 for all rows, and "z" is calculated as the square of "x" plus "y" for each row.
```{r}
tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)
```

Task 4:Creating a tibble with columns named ":)" (representing "smile"), " " (representing "space"), and "2000" (representing "number").
```{r}
tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)
tb
```

Task 5:Creating a tibble with columns "x," "y," and "z," containing the values "a," 2, 3.6 and "b," 1, 8.5 respectively.

```{r}
tribble(
  ~x, ~y, ~z,
  
  "a", 2, 3.6,
  "b", 1, 8.5
)
```

# Tibbles vs. data.frame

Task-1:Creating a tibble with columns "a," "b," "c," "d," and "e," containing 1000 randomly generated values for each column, representing dates, numbers, and letters.

```{r}
tibble(
  a = lubridate::now() + runif(1e3) * 86400,
  b = lubridate::today() + runif(1e3) * 30,
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)
```
Task 2: Tnstalling the package
```{r}
package_to_install <- c("nycflights13")

for (package_name in package_to_install) {
  if (!requireNamespace(package_name, quietly = TRUE)) {
    install.packages(package_name)
  }
}
library(nycflights13)
```

Task 3: Printing the first 10 rows of the nycflights13::flights dataset with unlimited width.
```{r}
nycflights13::flights %>% 
  print(n = 10, width = Inf)
```


Task 4: Viewing the nycflights13::flights dataset in a separate window for interactive exploration.
```{r}
nycflights13::flights %>% 
  View()
```

## Subsetting

Task 1: Creating a tibble named "df" with columns "x" and "y," then accessing the "x" column using different methods:
```{r}
df <- tibble(
  x = runif(5),#function that generates random numbers from a uniform distribution
  y = rnorm(5) # function that generates random numbers from a normal (Gaussian) distribution
)

df$x

df[["x"]]

df[[1]]

df %>% .$x


```
## Interacting with older code
Task-1: Determining the class of the object "tb" after converting it to a data frame.

```{r}
class(as.data.frame(tb))
```
## Exercises
Task-1: How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame).
```{r}
mtcars
```
Task-2
```{r}
# In a data.frame, extracting a non-existent column returns NULL,
# whereas in a tibble, it raises an error, providing immediate feedback.
# Other operations, such as extracting existing columns and subsets of columns,
# behave similarly across both data frames and tibbles.
# The default behavior of data.frames may lead to frustration
# due to the lack of error feedback for non-existent columns,
# potentially causing unnoticed mistakes and difficulty in debugging.
# In contrast, tibbles offer more robust behavior, enhancing data integrity
# and debugging efficiency.

df <- data.frame(abc = 1, xyz = "a")

# Extracting non-existent column in a data.frame
df$x  # Returns NULL

# Extracting existing column in a data.frame
df[, "xyz"]  # Returns a data frame with one column containing the values of the "xyz" column

# Extracting multiple columns in a data.frame
df[, c("abc", "xyz")]  # Returns a data frame containing only the specified columns

```
Task-3:If you have the name of a variable stored in an object, e.g. var <- "mpg", how can you extract the reference variable from a tibble?


# No pacakages
```{r}
# heights <- read_csv("data/heights.csv")
```

Task 1:  listing several tables: table1, table2, table3, table4a, and table4b.
```{r}
table1
table2
table3
table4a
table4b
```
Task 2: Calculating the rate by dividing the number of cases by the population and then multiplying by 10,000 for table1.
```{r}
table1 %>% 
  mutate(rate = cases / population * 10000)
```
Task 3: Counting the occurrences of each year in table1, using the 'cases' column as the weight.
```{r}
table1 %>% 
  count(year, wt = cases)
```
Task 4: Creating a ggplot using table1, plotting 'year' against 'cases' with lines grouped by 'country' and colored in grey50, along with points colored by 'country'.

```{r}
library(ggplot2)
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))
```
#  Pivoting
## Longer
Task-1: referring to 'table4a'
```{r}
table4a
```
Task-2: Reshaping table4a using pivot_longer for columns '1999' and '2000' into 'year' and 'cases'.
```{r}
table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
```
Task-3: Reshaping table4b with pivot_longer for columns '1999' and '2000' into 'year' and 'population'.
```{r}
table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")  #function transforms wide data into long format by stacking multiple columns into two: one for variable names and one for their corresponding values
```
Task-4: creating tidy datasets tidy4a and tidy4b by using pivot_longer on table4a and table4b to reshape them. Then, performing a left join on tidy4a and tidy4b.
```{r}
tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b)
```
## Wider
Task-1:Displaying table 2
```{r}
table2
```

Task-2: using the pivot_wider function on table2 to transform it from long to wide format, with 'type' becoming the new column names and 'count' being the corresponding values.
```{r}
table2 %>%
    pivot_wider(names_from = type, values_from = count)
```
###########
# Separating and uniting
## Separate
Task-1:displaying table3
```{r}
 table3
```

Task-2: Using the separate function on table3 splits the 'rate' column into two separate columns named 'cases' and 'population'.
```{r}
table3 %>% 
  separate(rate, into = c("cases", "population"))
```
Task-3:Using the separate function on table3 splits the 'rate' column into two separate columns named 'cases' and 'population', using the '/' character as the separator.
```{r}
table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")
```
Task-4:Using the separate function on table3 splits the 'rate' column into two separate columns named 'cases' and 'population', converting the resulting columns to their appropriate data types.
```{r}
table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)
```
Task-5:  Applying the separate function to table3, the 'year' column is divided into two separate columns labeled 'century' and 'year', with the separator defined as the second character.
```{r}
table3 %>% 
  separate(year, into = c("century", "year"), sep = 2)
```
## Unite

Task-1: The unite function is applied to table5 to merge the 'century' and 'year' columns into a single column named 'new'.
```{r}
table5 %>% 
  unite(new, century, year)
```
Task-2: unite function is applied to table5 to merge the 'century' and 'year' columns into a single column named 'new', with no separator between them.
```{r}
table5 %>% 
  unite(new, century, year, sep = "")
```
#  Missing values
Task-1: Create a tibble named "stocks" with columns "year", "qtr" (quarter), and "return", having data for 2015 and 2016, with quarterly returns specified and some missing entries as NA.
```{r}
stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
```


Task-2:Pivoting the "stocks" tibble to widen the data, extracting columns from the "year" variable and values from the "return" variable.
```{r}
stocks %>% 
  pivot_wider(names_from = year, values_from = return)
```

Task-3: pivot the data to a wide format with columns for each year's returns, then reshape it back to a long format, keeping only the non-missing values in the "return" column.
```{r}
stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(
    cols = c(`2015`, `2016`), 
    names_to = "year", 
    values_to = "return", 
    values_drop_na = TRUE
  )
```
Task-4:Filling missing combinations of "year" and "qtr" in the "stocks" dataset.
```{r}
stocks %>% 
  complete(year, qtr)
```
Task-5:Creating a tibble named "treatment" containing information about individuals, their treatment groups, and their responses, with some missing values for the "person" column.
```{r}
treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)
```
Task-6: Filling the missing values in the "person" column of the "treatment" tibble.
```{r}
treatment %>% 
  fill(person)

```

# Case Study
Task-1: Loading data set
```{r}
who
```

Task-2:Pivoting the "who" dataset from wide to long format, condensing columns into "cases" and capturing the original column names in "key".
```{r}
who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1
```
Task-3:Counting the occurrences of each "key" in the "who1" dataset.
```{r}
  who1 %>% 
    count(key)
```
Task-4:Replacing "newrel" with "new_rel" in the "key" column of the "who1" dataset to create "who2."
```{r}
who2 <- who1 %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
who2
```
Task-5:Separating the "key" column in the "who2" dataset into "new," "type," and "sexage" columns using "_" as the separator to create "who3."
```{r}
who3 <- who2 %>% 
  separate(key, c("new", "type", "sexage"), sep = "_")
who3
```
Task-6:Counting the occurrences of each unique value in the "new" column of the "who3" dataset.
```{r}
who3 %>% 
  count(new)
```

Task-7:Removing the "new", "iso2", and "iso3" columns from the "who3" dataset and assigning the result to "who4".
```{r}
who4 <- who3 %>% 
  select(-new, -iso2, -iso3)
```

Task-8:Splitting the "sexage" column of the "who4" dataset into "sex" and "age" columns, separated by the first character, and assigning the result to "who5".
```{r}
who5 <- who4 %>% 
  separate(sexage, c("sex", "age"), sep = 1)
who5
```
Task-9:Transforming the "who" dataset from wide to long format, adjusting column names, extracting meaningful variables, dropping unnecessary columns, and splitting the "sexage" column into "sex" and "age".
```{r}
who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

```


## CH-13: Relational data

Task-1:Loding the libraries
```{r}
library(tidyverse)
library(nycflights13)
```

## nycflights13
Task-1: airlines data
```{r}
airlines
```

Task-2: airports data
```{r}
airports
```
Task-3: planes data
```{r}
planes 
```
Task-4: weather data
```{r}
weather 
```
################################################################################

# Keys 
Task-1Counting the occurrences of each tail number in the "planes" table and filtering for those with more than one occurrence.
```{r}
planes %>% 
  count(tailnum) %>% 
  filter(n > 1)
```

Task-2:Counting the occurrences of each combination of year, month, day, hour, and origin in the "weather" table and filtering for those with more than one occurrence.
```{r}
weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1)
```
Task-3:Counting the occurrences of each combination of year, month, day, and flight in the "flights" table and filtering for those with more than one occurrence.
```{r}
flights %>% 
  count(year, month, day, flight) %>% 
  filter(n > 1)
```
Task-4:Counting the occurrences of each combination of year, month, day, and tail number in the "flights" table and filtering for those with more than one occurrence.
```{r}
flights %>% 
  count(year, month, day, tailnum) %>% 
  filter(n > 1)
```
# Mutating joins

Task-1: Creating a subset of the "flights" table named "flights2" containing columns from "year" to "day", "hour", "origin", "dest", "tailnum", and "carrier".
```{r}
flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2
```
Task-2:Removing the "origin" and "dest" columns from "flights2" table and then performing a left join with the "airlines" table, using the "carrier" column as the key for matching.
```{r}
flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier")
```
Task-3:Shortening the command by removing "selecting" and directly "mutating" the "name" column with the corresponding airline names from the "airlines" table based on the "carrier" column.
```{r}
flights2 %>%
  select(-origin, -dest) %>% 
  mutate(name = airlines$name[match(carrier, airlines$carrier)])
```
#  Understanding joins
Task-1:Creating two tibbles, "x" and "y", each with a "key" column and an associated "val_x" or "val_y" column, respectively.
```{r}
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)

x
y
```

## Inner join
Task-1:Joining tibbles `x` and `y` using an inner join operation based on the "key" column.
```{r}
x %>% 
  inner_join(y, by = "key")
```
## Duplicate keys
Task-1: Joining tibble x with tibble y using the common column "key".
```{r}
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
```

Task-2:Performing a left join between tibble `x` and tibble `y` based on the common column "key".
```{r}
left_join(x, y, by = "key")
```

Task-3:Creating two tibbles, `x` and `y`, with columns "key", "val_x", and "val_y", populated with corresponding values.
```{r}
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4"
)
```

Task-4:Performing a left join on tibbles `x` and `y` using the "key" column as the join key.
```{r}
left_join(x, y, by = "key")
```
# Defining the key columns
Task-1:Performing a left join between the `flights2` tibble and the `weather` tibble.
```{r}
flights2 %>% 
  left_join(weather)
```
Task-2:Performing a left join between the `flights2` tibble and the `planes` tibble using the "tailnum" column as the key.
```{r}
flights2 %>% 
  left_join(planes, by = "tailnum")
```
Task-3:Performing a left join between the `flights2` tibble and the `airports` tibble, matching the "dest" column from `flights2` with the "faa" column from `airports`.
```{r}
flights2 %>% 
  left_join(airports, c("dest" = "faa"))
```

Task-4:Performing a left join between the `flights2` tibble and the `airports` tibble, matching the "origin" column from `flights2` with the "faa" column from `airports`.
```{r}
flights2 %>% 
  left_join(airports, c("origin" = "faa"))
```
#  Filtering joins
Task-1: Calculating the top 10 destinations by counting the occurrences in the "dest" column of the `flights` tibble, sorted in descending order, and then displaying the result.
```{r}
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest
```
Task-2: Filtering the `flights` tibble to include only rows where the destination (`dest`) matches any of the top 10 destinations identified in the previous step.
```{r}
flights %>% 
  filter(dest %in% top_dest$dest)
#%in% operator in R is used to check if elements in one vector are present in another vector
```

Task-3: Selecting rows from the `flights` dataset where the destination airport matches one of the top 10 destinations previously identified.
```{r}
flights %>% 
  semi_join(top_dest)
```
Task-4: Filtering out flights with tail numbers present in the planes dataset and counting the occurrences of each unique tail number, sorting the result.
```{r}
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)
```
# Set operations
Task-1:creating two tibbles, df1 and df2, each with columns x and y, containing sample data.
```{r}
df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)
```

Task-2:performing set operations on the tibbles df1 and df2, including intersection, union, and set differences.
```{r}
intersect(df1, df2)
union(df1, df2)
setdiff(df1, df2)
setdiff(df2, df1)
```

# CH-14: Strings
Basic Info:string1 <- "This is a string"
           string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
           
Task-1:To include a literal single or double quote in a string you can use \ to “escape” it         
```{r}
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```

Task-2: Understanding the character 
```{r}

x <- c("\"", "\\") #backslash is escape character
x
writeLines(x)
```
#  String length
Task-1:
```{r}
str_length(c("a", "R for data science", NA))
```

# Combining strings
Task-1:Combining  the strings
```{r}
str_c("x", "y")
str_c("x", "y", "z")
```
Task-2:Using the sep argument to control how they’re separated.
```{r}
str_c("x", "y", sep = ", ")
```
Task-3:Performing concatenation with "|" and "-" at both ends of each element of vector x, and replacing NA values with empty strings before concatenation.
```{r}
x <- c("abc", NA)
str_c("|-", x, "-|")
str_c("|-", str_replace_na(x), "-|")
```
Task-4: concatenating each element of the vector c("a", "b", "c") with a prefix "prefix-" and a suffix "-suffix".
```{r}
str_c("prefix-", c("a", "b", "c"), "-suffix")
```
Task-5: combining strings
```{r}
name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE

str_c(
  "Good ", time_of_day, " ", name,
  if (birthday) " and HAPPY BIRTHDAY",
  "."
)
```
# Subsetting strings
Task-1:Extracting the first three characters from each element in the vector `x` using `str_sub`.
```{r}
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
```
Task-2:negative numbers count backwards from end
```{r}
str_sub(x, -3, -1)
```
Task-3:using the assignment form of str_sub() to modify strings
```{r}
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
```

# Locales
Task-1:Changing the case 
```{r}
str_to_upper(c("i", "ı"))
str_to_upper(c("i", "ı"), locale = "tr")
```
Task-2:Sorting the character vector x alphabetically using the English (en) locale and the Hawaiian (haw) locale.
```{r}
x <- c("apple", "eggplant", "banana")
str_sort(x, locale = "en") 
str_sort(x, locale = "haw") 
```
#  Matching patterns with regular expressions

## Basic matches
Task-1:Searching for the pattern "an" within each element of `x` and displaying the matches.
```{r}
x <- c("apple", "banana", "pear")
str_view(x, "an")
```

Task-2:Displaying elements in `x` where any character is followed by "a" and then any character.
```{r}
str_view(x, ".a.")
```
Task-3 
```{r}
# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
writeLines(dot)

# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")

```
Task-4: Displaying elements in `x` where the sequence "\\" occurs.
```{r}
x <- "a\\b"
writeLines(x)

str_view(x, "\\\\")
```
##  Anchors
Task-1: Displaying elements in `x` that start with "a" and end with "a" respectively.
```{r}
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
```
Task-2: Highlighting "apple" occurrences in `x` and instances where it's the only content.
```{r}
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")
```
## Character classes and alternatives

Task-1: Visualizing patterns matching "a.c", "a*c", and "a c" in the provided character vector.
```{r}
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
```
Task-2: Visualizing patterns matching "grey" or "gray" in the provided character vector.
```{r}
str_view(c("grey", "gray"), "gr(e|a)y")
```
## Repetition
Task-1:Identifying patterns "CC" or "C" in the string "1888 is the longest year in Roman numerals
```{r}
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
```
Task-2: Viewing the pattern "CC"
```{r}
str_view(x, "CC+")
```
Task-3: Viewing the pattern "C[LX]+"
```{r}
str_view(x, 'C[LX]+')
```
Task-4:Viewing the pattern "C{2},C{2,},c{2,3}"
```{r}
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
```
## Grouping and backreferences
Task-1:Grouping
```{r}
str_view(fruit, "(..)\\1", match = TRUE)
```

## Detect matches
Task-1: Checking for the presence of the letter "e" in each word 
```{r}
x <- c("apple", "banana", "pear")
str_detect(x, "e")
```

Task-2:Checking how many common words start with t
```{r}
sum(str_detect(words, "^t"))
```

Task-3: Checking proportion of common words end with a vowel
```{r}
mean(str_detect(words, "[aeiou]$"))
```
Task-4:Finding all words containing at least one vowel, and negate
```{r}
no_vowels_1 <- !str_detect(words, "[aeiou]")
```

Task-5:Finding all words consisting only of consonants (non-vowels)
```{r}
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
```

Task-6: Filtering words that end with the letter "x" from a list of words.
```{r}
words[str_detect(words, "x$")]
str_subset(words, "x$")
```
Task-7: Filtering a tibble for words that end with "x".
```{r}
df <- tibble(
  word = words, 
  i = seq_along(word)
)
df %>% 
  filter(str_detect(word, "x$"))
```

Task-8:Counting the occurrences of "a" in each element of a character vector.
```{r}
x <- c("apple", "banana", "pear")
str_count(x, "a")
```
Task-9: Seeing average of how many vowels per word
```{r}
mean(str_count(words, "[aeiou]"))
```
Task-10: Adding columns to a tibble to count vowels and consonants in each word.
```{r}
df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )
```

Task-11:Counting "aba" occurrences in "abababa" and showing all "aba" instances.
```{r}
str_count("abababa", "aba")
str_view_all("abababa", "aba")
```
## Extract matches
Task-1: Displaying the length of sentences and showing the first few sentences.
```{r}
length(sentences)
head(sentences)
```

Task-2: Creating a string pattern to match colors by concatenating them with a pipe delimiter.
```{r}
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
```
Task-3: Filter sentences for colors and extract matches, showing the first few.
```{r}
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
```
Task-4:Showing all sentences containing multiple colors and highlight the matches.
```{r}
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
```
Task-5:Extracting all color matches from the subset of sentences containing multiple colors.
```{r}
str_extract(more, colour_match)
```

Task-6:Extracting all occurrences of colors from the subset of sentences containing multiple colors.
```{r}
str_extract_all(more, colour_match)
```
Task-7: Extracting colors from sentences with multiple colors and simplify, also extract lowercase letters from each element in x and simplify.
```{r}
str_extract_all(more, colour_match, simplify = TRUE)
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
```
## Grouped matches 
Task-1: Extracting sentences containing nouns defined by a pattern, then extracts the nouns from those sentences.
```{r}
noun <- "(a|the) ([^ ]+)"

has_noun <- sentences %>%
  str_subset(noun) %>%
  head(10)
has_noun %>% 
  str_extract(noun)
```

Task-2:
```{r}
has_noun %>% 
  str_match(noun)
```
Task-3:Creating a tibble with columns 'article' and 'noun' extracted from sentences based on a pattern.
```{r}
tibble(sentence = sentences) %>% 
  tidyr::extract(
    sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
    remove = FALSE
  )
```
## Replacing matches
Task-1: Replacing the first vowel in each word of x with a hyphen.
        Replacing all vowels in each word of x with a hyphen.
```{r}
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
str_replace_all(x, "[aeiou]", "-")
```
Task-2: Replacing numeric values in x with their corresponding word representations.
```{r}
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
Task-3:Reordering words in sentences by swapping the second and third word positions.
```{r}
sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(5)
```
# Splitting
Task-1: Splitting the first five sentences into words.
```{r}
sentences %>%
  head(5) %>% 
  str_split(" ")
```
Task-2:Splitting the string 'a|b|c|d' by '|' into a vector of elements.
```{r}
"a|b|c|d" %>% 
  str_split("\\|") %>% 
  .[[1]]
```
Task-3:Splitting the first 5 sentences by space into a matrix of words.
```{r}
sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)
```
Task-4:Splitting each field string into two parts at the first occurrence of ': '.
```{r}
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
```
Task-5: Display word boundaries, split by spaces, and split by word boundaries, respectively.
```{r}
x <- "This is a sentence.  This is another sentence."
str_view_all(x, boundary("word"))
str_split(x, " ")[[1]]
str_split(x, boundary("word"))[[1]]
```
# Other types of pattern

Task-1: 
```{r}
# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))
```
Task-2:Visualizing occurrences of "banana" in different case variations.
```{r}
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
str_view(bananas, regex("banana", ignore_case = TRUE))
```
Task-3: Extracting all lines starting with "Line" from the text.
```{r}
x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
```
Task-4: Extracting all occurrences of lines starting with "Line" from the text, considering each line separately.
```{r}
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
```
Task-5:Creating a regular expression pattern for phone numbers, allowing for variations in formatting, and attempting to match it against the provided phone number.
```{r}
phone <- regex("
  \\(?     # optional opening parens
  (\\d{3}) # area code
  [) -]?   # optional closing parens, space, or dash
  (\\d{3}) # another three numbers
  [ -]?    # optional space or dash
  (\\d{3}) # three more numbers
  ", comments = TRUE)

str_match("514-791-8141", phone)
```
Task-6:Installling the package and Benchmarking string detection in "sentences" using fixed and regex patterns 20 times each, comparing performance with microbenchmark.
```{r}

package_to_install <- c("microbenchmark")

for (package_name in package_to_install) {
  if (!requireNamespace(package_name, quietly = TRUE)) {
    install.packages(package_name)
  }
}
library(microbenchmark)

microbenchmark::microbenchmark(
  fixed = str_detect(sentences, fixed("the")),
  regex = str_detect(sentences, "the"),
  times = 20
  )
```
Task-7:Starting with a1 being "\u00e1" and a2 being "a\u0301", both representing the character "á", they are compared for equality.
```{r}
a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
a1 == a2
```
Task-8: Checking if `a1` contains the fixed string `a2` returns `FALSE`, whereas using collation rules returns `TRUE`.
```{r}
str_detect(a1, fixed(a2))

str_detect(a1, coll(a2))
```
Task-9:Creating a vector `i` with different forms of the letter "i", then using `str_subset` to filter them based on collation.
```{r}
i <- c("I", "İ", "i", "ı")
i
str_subset(i, coll("i", ignore_case = TRUE))
str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
```

Task-10: Fetching locale information.
```{r}
stringi::stri_locale_info()

```
Task-11:Visualizing word boundaries and extracts all words from the string.
```{r}
x <- "This is a sentence."
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
```
# CH-15: Factors
## Creatig factors
Task-1:Adding character vector in variable x1
```{r}
x1 <- c("Dec", "Apr", "Jan", "Mar")
```

Task-2:Adding character vector in variable x2
```{r}
x2 <- c("Dec", "Apr", "Jam", "Mar")
```

Task-3:Sorting X1 
```{r}
sort(x1)

```
Task-4:Adding Character vector in month_levels
```{r}
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
```

Task-5:Assigning the factor levels to the variable x1, using the predefined month_levels.
```{r}
y1 <- factor(x1, levels = month_levels)
y
```
Task-6:Sorting the factor levels in y1.
```{r}
sort(y1)
```
Task-7:creating a factor y2 from x2 with custom levels specified by month_levels.
```{r}
y2 <- factor(x2, levels = month_levels)
y2
```
Task-8:parsing the values in x2 as factors
```{r}
y2 <- parse_factor(x2, levels = month_levels)
```
Task-9: omitting the levels.
```{r}
factor(x1)
```
Task-10:Creating a factor f1 from the values in x1, using the unique values of x1 as levels.
```{r}
f1 <- factor(x1, levels = unique(x1))
f1
```
Task-11: creating a factor f2 from the values in x1, ordering them according to their appearance in x1.
```{r}
f2 <- x1 %>% factor() %>% fct_inorder()
f2
```
Task-12:Omitting levels2
```{r}
levels(f2)
```
# General Social Survey
Task-1:Loading datasets
```{r}
gss_cat
```

Task-2:Seeing levels through count()
```{r}
gss_cat %>%
  count(race)
```
Task-3:Also seeing through bar()
```{r}
ggplot(gss_cat, aes(race)) +
  geom_bar()
```

Task-4:Generating a bar plot using ggplot()
```{r}
ggplot(gss_cat,aes(race))+geom_bar()+scale_x_discrete(drop=FALSE)
```
# Modifying factor order
Task-1:calculating summary statistics and then creating scatter plot 
```{r}
relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
```

Task-2:Generating a scatter plot using `ggplot`, where the x-axis represents the mean TV hours (`tvhours`), and the y-axis represents the `relig` variable reordered by mean TV hours.
```{r}
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()
```
Task-3:Creating a scatter plot using ggplot.
```{r}
relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()
```
Task-4:Generating a scatter plot using ggplot
```{r}
rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()
```
Task-5: creates a scatter plot of the average age by income level, with "Not applicable" as the reference level for income
```{r}
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
  geom_point()
```
Task-6:calculating the proportion of each marital status group across different age groups and creates a line plot showing the distribution of marital status proportions by age.
```{r}
by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE)

ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
  geom_line() +
  labs(colour = "marital")
```
Task-7: Adjusting the order of the "marital" variable based on frequency and then reverses the order before generating a bar plot illustrating the distribution of marital status.
```{r}
gss_cat %>%
  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
  ggplot(aes(marital)) +
    geom_bar()
```
# Modifying factor levels

Task-1: counting the frequency of each unique value in the "partyid" variable of the "gss_cat" dataset.
```{r}
gss_cat%>%count(partyid)
```
Task-2:Recording the levels of the "partyid" variable in the "gss_cat" dataset and then counts the frequency of each unique recorded value.
```{r}
gss_cat %>%
  mutate( partyid=fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
    ))%>%
  count(partyid)
```
Task-3:Recategorizing and counting party affiliations in the "gss_cat" dataset.
```{r}
gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party"
  )) %>%
  count(partyid)
```
Task-4: Collapsing categories within the "partyid" variable in the "gss_cat" dataset into broader groups and then counting the frequency of each collapsed category.
```{r}
gss_cat%>%
  mutate(partyid=fct_collapse(partyid,
                              other=c("No answer", "Don't know", "Other party"),
                              rep=c("Strong republican", "Not str republican"),
                              ind=c("Ind,near rep", "Independent", "Ind,near dem"),
                              dem=c("Not str democrat", "Strong democrat"))) %>%
  count(partyid)
```

Task-5:Counting and aggregating religious affiliations in the "gss_cat" dataset after lumping together less frequent categories.
```{r}
gss_cat %>%
  mutate(relig = fct_lump(relig)) %>%
  count(relig)
```
Task-6:"Summarizing religious affiliations after lumping infrequent categories and sort."
```{r}
gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print(n = Inf)
```

# CH-Data and Times

Task-1:Loading library
```{r}
library(tidyverse)

library(lubridate)
library(nycflights13)
```

## Creating dates/times
Task-1: Printing  current date or date-time
```{r}
today()
now()
```
## Form strings
Task-2:converting date strings to date objects in different formats.
```{r}
ymd("2017-01-31")
mdy("January 31st, 2017")
dmy("31-Jan-2017")
```
```{r}
ymd(20170131)
```

```{r}
ymd_hms("2017-01-31 20:11:59")
mdy_hm("01/31/2017 08:01")
```


```{r}
flights %>% 
  select(year, month, day, hour, minute)
flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute))
```
Task: Creating date-time objects from hour-minute time data in the 'flights' dataset and filtering out rows with missing departure or arrival times
```{r}
make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
}

flights_dt <- flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) %>% 
  select(origin, dest, ends_with("delay"), ends_with("time"))

flights_dt
```
Task: Plotting the frequency of flights over time using departure date-time

```{r}
flights_dt %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 86400) 
```
Task: Plotting the frequency of flights over time for a specific period using departure date-time

```{r}
flights_dt %>% 
  filter(dep_time < ymd(20130102)) %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 600) # 600 s = 10 minutes
```
Task: to convert today's date to date-time object
```{r}
as_datetime(today())

as_date(now())

as_date(365 * 10 + 2)

```
Date-time components
Task: Extracting various components of a date-time object
```{r}
datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
month(datetime)
mday(datetime)
yday(datetime)
wday(datetime)
month(datetime, label = TRUE)
wday(datetime, label = TRUE, abbr = FALSE)
```
Task: Plotting the frequency of flights by day of the week
```{r}
flights_dt %>% 
  mutate(wday = wday(dep_time, label = TRUE)) %>% 
  ggplot(aes(x = wday)) +
    geom_bar()
```
Task: Plotting average delay by minute of departure time
```{r}
flights_dt %>% 
  mutate(minute = minute(dep_time)) %>% 
  group_by(minute) %>% 
  summarise(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n = n()) %>% 
  ggplot(aes(minute, avg_delay)) +
    geom_line()
```
Task: Plotting average delay by minute of scheduled departure time

```{r}
sched_dep <- flights_dt %>% 
  mutate(minute = minute(sched_dep_time)) %>% 
  group_by(minute) %>% 
  summarise(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n = n())

ggplot(sched_dep, aes(minute, avg_delay)) +
  geom_line()
```
Task: Plotting the number of flights by minute of scheduled departure time
```{r}
ggplot(sched_dep, aes(minute, n)) +
  geom_line()
```
Rounding
Task:Plotting the number of flights by week, rounding to the nearest week

```{r}
flights_dt %>% 
  count(week = floor_date(dep_time, "week")) %>% 
  ggplot(aes(week, n)) +
    geom_line()

```
setting compounds
Task: Setting up a date-time object
```{r}
(datetime <- ymd_hms("2016-07-08 12:34:56"))
year(datetime) <- 2020
datetime
month(datetime) <- 01
datetime
hour(datetime) <- hour(datetime) + 1
datetime
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
```
```{r}
ymd("2015-02-01") %>% 
  update(mday = 30)
ymd("2015-02-01") %>% 
  update(hour = 400)
```
Task: Creating a new variable 'dep_hour' by updating the 'dep_time' to the first day of the year

```{r}
flights_dt %>% 
  mutate(dep_hour = update(dep_time, yday = 1)) %>% 
  ggplot(aes(dep_hour)) +
    geom_freqpoly(binwidth = 300)
```
Time Spans
Compute the age of a person based on their birthdate and today's date
```{r}
h_age <- today() - ymd(19791014)
h_age
as.duration(h_age)
```
```{r}
dseconds(15)
dminutes(10)
dhours(c(12, 24))
ddays(0:5)
dweeks(3)
dyears(1)
```
```{r}
2 * dyears(1)
dyears(1) + dweeks(12) + dhours(15)
tomorrow <- today() + ddays(1)
last_year <- today() - dyears(1)
one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")
one_pm
one_pm + ddays(1)
```
Periods
Create period objects representing different time spans and Perform arithmetic operations with period objects
```{r}
one_pm
one_om = days(1)
```
```{r}
seconds(15)
minutes(10)
hours(c(12, 24))
days(7)
months(1:6)
weeks(3)
years(1)
```

```{r}
10 * (months(6) + days(1))
days(50) + hours(25) + minutes(2)
```

```{r}
ymd("2016-01-01") + dyears(1)
ymd("2016-01-01") + years(1)
one_pm + ddays(1)
one_pm + days(1)
```

Filter flights where arrival time is before departure time
```{r}
flights_dt %>% 
  filter(arr_time < dep_time) 
```

Update flights data to correct overnight flights
```{r}
flights_dt <- flights_dt %>% 
  mutate(
    overnight = arr_time < dep_time,
    arr_time = arr_time + days(overnight * 1),
    sched_arr_time = sched_arr_time + days(overnight * 1)
  )
```

Filter flights where overnight condition is true and arrival time is before departure time
```{r}
flights_dt %>% 
  filter(overnight, arr_time < dep_time) 
```

Intervals
Calculate the ratio of one year in days
```{r}
years(1) / days(1)
next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
(today() %--% next_year) %/% days(1)
```

Display time zone information
```{r}
Sys.timezone()
length(OlsonNames())
head(OlsonNames())
```
```{r}
(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
```
```{r}
x1 - x2
x1 - x3
```
# Pipes

Task: To import the required library
```{r}
packages_to_install <- c("tidyverse", "pryr")
for (package_name in packages_to_install) {
  if (!requireNamespace(package_name, quietly = TRUE)) {
    install.packages(package_name)
  }
  library(package_name, character.only = TRUE)
}

library(magrittr)
```

Create diamond data and calculate the object sizes
```{r}
diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>% 
  dplyr::mutate(price_per_carat = price / carat)

pryr::object_size(diamonds)
pryr::object_size(diamonds2)
pryr::object_size(diamonds, diamonds2)
```
Functions
Normalize the columns of a data frame
```{r}
df <- tibble::tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / 
  (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / 
  (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / 
  (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
```
Normalize a single column of a data frame
```{r}
(df$a - min(df$a, na.rm = TRUE)) /
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
```
```{r}
x <- df$a
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
```
```{r}
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
```
```{r}
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5, 10))
```

Rescale a vector to the range [0, 1]
```{r}
rescale01(c(-10, 0, 10))
rescale01(c(1, 2, 3, NA, 5))
```

Rescale each column of a DataFrame to the range [0, 1]
```{r}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```
```{r}
x <- c(1:10, Inf)
rescale01(x)
```

Define the rescale01 function and apply it
```{r}
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
```

Load required libraries and packages
```{r}
library(tidyverse)
library(purrr)
library(magrittr)

# install.packages("pryr")
library(pryr)
```

## 18.2 Piping alternatives
This is a popular Children’s poem that is accompanied by hand actions.We’ll start by defining an object to represent little bunny Foo Foo:

```{r 18.2-1}
# foo_foo <- little_bunny()
```
### 18.2.1 Intermediate steps
The simplest approach is to save each step as a new object:

```{r 18.2.1-1}
# foo_foo_1 <- hop(foo_foo,through=forest)
# foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
# foo_foo_3 <- bop(foo_foo_2, on = head)
```

Create diamonds dataset and calculate price per carat
```{r 18.2.1-2}
diamonds <- ggplot2::diamonds
diamonds2 <- diamonds %>% 
  dplyr::mutate(price_per_carat=price/carat)

pryr::object_size(diamonds)
pryr::object_size(diamonds2)
pryr::object_size(diamonds,diamonds2)
```

Introduce NA value into diamonds$carat and check object sizes
```{r 18.2.1-3}
diamonds$carat[1] <- NA
pryr::object_size(diamonds)
pryr::object_size(diamonds2)
pryr::object_size(diamonds,diamonds2)
```

### 18.2.2 Overwrite the original
Instead of creating intermediate objects at each step, we could overwrite the original object:
```{r 18.2.2-1}
# foo_foo <- hop(foo_foo, through = forest)
# foo_foo <- scoop(foo_foo, up = field_mice)
# foo_foo <- bop(foo_foo, on = head)
```
### 18.2.3 Function composition
Another approach is to abandon assignment and just string the function calls together:
```{r 18.2.3-1}
# bop(
#   scoop(
#     hop(foo_foo, through = forest),
#     up = field_mice
#   ), 
#   on = head
# )
```

Here the disadvantage is that you have to read from inside-out, from right-to-left, and that the arguments end up spread far apart (evocatively called the dagwood sandwhich problem). In short, this code is hard for a human to consume.

### 18.2.4 Use the pipe
Finally, we can use the pipe:
```{r 18.2.4-1}
# foo_foo %>%
#   hop(through = forest) %>%
#   scoop(up = field_mice) %>%
#   bop(on = head)
```

```{r 18.2.4-2}
# my_pipe <- function(.) {
#   . <- hop(., through = forest)
#   . <- scoop(., up = field_mice)
#   bop(., on = head)
# }
# my_pipe(foo_foo)
```
TASK:  Functions that use the current environment. For example, `assign()` will create a new variable with the given name in the current environment:
```{r 18.2.4-3}
assign("x",10)
x

"x" %>% assign(100)
x
```
Assign value to "x" in the specified environment and check its value and Generate random numbers, create a matrix, plot it, and inspect its structure
```{r 18.2.4-4}
env <- environment()
"x" %>% assign(100,envir=env)
x
```
```{r 18.4-1}
rnorm(100) %>% 
  matrix(ncol=2) %>% 
  plot() %>% 
  str()

rnorm(100) %>% 
  matrix(ncol=2) %>% 
  plot() %>% 
  str()

ndist <- rnorm(100000)
hist(ndist)
```

Calculate the correlation between two variables in mtcars dataset
```{r 18.4-2}
mtcars %$%
  cor(disp, mpg)
```

- For assignment magrittr provides the `%<>%` operator which allows you to replace code like:
```{r 18.4-3}
mtcars <- mtcars %>% 
  transform(cyl=cyl*2)
```


```{r 18.4-4}
mtcars %<>% transform(cyl=cyl*2)
```
# Chapter 19 Functions
## 19.1 Introduction

## 19.2 When should you write a function?
```{r 19.2-1}
df <- tibble::tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df

df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / 
  (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / 
  (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / 
  (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
```

Rescale a single variable in a data frame
```{r 19.2-2}
(df$a - min(df$a, na.rm = TRUE)) /
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
```

Rescale a single variable without creating a new object
```{r}
x <- df$a
(x - min(x, na.rm = T)) / (max(x, na.rm = T)-min(x, na.rm = T))
```

Task: There is some duplication in this code. We’re computing the range of the data three times, so it makes sense to do it in one step:
```{r}
rng <- range(x, na.rm = T)
(x-rng[1])/(rng[2]-rng[1])
```

Pulling out intermediate calculations into named variables is a good practice because it makes it more clear what the code is doing. Now that I’ve simplified the code, and checked that it still works, I can turn it into a function:
```{r 19.2-5}
rescale01 <- function(x){
  rng <- range(x, na.rm = T)
  (x-rng[1])/(rng[2]-rng[1])
}
rescale01(c(0,5,10))
```

Test the rescale01 function with various inputs
```{r 19.2-6}
rescale01(c(-10,0,10))
rescale01(c(1,2,3,NA,5))
```

We can simplify the original example now that we have a function:
```{r 19.2-7}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```

Rescale a vector with infinite values
```{r 19.2-8}
x <- c(1:10,Inf)
rescale01(x)
```

Because we’ve extracted the code into a function, we only need to make the fix in one place:
```{r 19.2-9}
rescale01 <- function(x){
  rng <- range(x,na.rm=T,finite=T)
  (x-rng[1])/(rng[2]-rng[1])
}
rescale01(x)
```

## 19.4 Conditional execution
An `if` statement allows you to conditionally execute code. 
It looks like this:
```{r}
# if (condition) {
  # code executed when condition is TRUE
# } else {
  # code executed when condition is FALSE
# }
```

Define a function to check if an object has names
```{r}
has_name <- function(x){
  nms <- names(x)
  if(is.null(nms)){
    rep(FALSE,length(x))
  }else {
    !is.na(nms) & nms !=""
  }
}
```
### 19.4.1 Conditions
how if condition works with warnings
```{r 19.4.1-1}
# if (c(TRUE,FALSE)){}
#> Warning in if (c(TRUE, FALSE)) {: the condition has length > 1 and only the
#> first element will be used
#> NULL

# if (NA) {}
```
Check if two objects are identical
```{r}
identical(0L,0)
x <- sqrt(2)^2
x==2
x-2
```

### 19.4.2 Multiple conditions
You can chain multiple if statement together:
```{r 19.4.2-1}
# if (this) {
#   # do that
# } else if (that) {
#   # do something else
# } else {
#   # 
# }
```

```{r 19.4.2-2}
#> function(x, y, op) {
#>   switch(op,
#>     plus = x + y,
#>     minus = x - y,
#>     times = x * y,
#>     divide = x / y,
#>     stop("Unknown op!")
#>   )
#> }
```
### 19.4.3 Code style

Good practice for writing if statements
```{r 19.4.3-1}
# Good
# if (y < 0 && debug) {
#   message("Y is negative")
# }
# 
# if (y == 0) {
#   log(x)
# } else {
#   y ^ x
# }
# 
# # Bad
# if (y < 0 && debug)
# message("Y is negative")
# 
# if (y == 0) {
#   log(x)
# } 
# else {
#   y ^ x
# }
```

It’s ok to drop the curly braces if you have a very short if statement that can fit on one line: 
```{r 19.4.3-2}
y <- 10
x <- if (y < 20) "Too low" else "Too high"
```

I recommend this only for very brief `if` statements. Otherwise, the full form is easier to read:
```{r}
if (y < 20) {
  x <- "Too low" 
} else {
  x <- "Too high"
}
```

## 19.5 Function arguments 

```{r}
# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
  se <- sd(x) / sqrt(length(x))
  alpha <- 1 - conf
  mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}

x <- runif(100)
mean_ci(x)

mean_ci(x, conf = 0.99)

```


### 19.5.1 Choosing names
### 19.5.2 Cheking values
```{r}
wt_mean <- function(x, w) {
  sum(x * w) / sum(w)
}
wt_var <- function(x, w) {
  mu <- wt_mean(x, w)
  sum(w * (x - mu) ^ 2) / sum(w)
}
wt_sd <- function(x, w) {
  sqrt(wt_var(x, w))
}
```

What happens if x and w are not the same length?
```{r}
wt_mean(1:6, 1:3)

```

In this case, because of R’s vector recycling rules, we don’t get an error.

It’s good practice to check important preconditions, and throw an error (with `stop()`), if they are not true:
```{r}
wt_mean <- function(x, w) {
  if (length(x) != length(w)) {
    stop("`x` and `w` must be the same length", call. = FALSE)
  }
  sum(w * x) / sum(w)
}
```

```{r}
wt_mean <- function(x, w, na.rm = FALSE) {
  if (!is.logical(na.rm)) {
    stop("`na.rm` must be logical")
  }
  if (length(na.rm) != 1) {
    stop("`na.rm` must be length 1")
  }
  if (length(x) != length(w)) {
    stop("`x` and `w` must be the same length", call. = FALSE)
  }
  
  if (na.rm) {
    miss <- is.na(x) | is.na(w)
    x <- x[!miss]
    w <- w[!miss]
  }
  sum(w * x) / sum(w)
}
```

This is a lot of extra work for little additional gain. A useful compromise is the built-in `stopifnot()`: it checks that each argument is `TRUE`, and produces a generic error message if not.
```{r}
wt_mean <- function(x, w, na.rm = FALSE) {
  stopifnot(is.logical(na.rm), length(na.rm) == 1)
  stopifnot(length(x) == length(w))
  
  if (na.rm) {
    miss <- is.na(x) | is.na(w)
    x <- x[!miss]
    w <- w[!miss]
  }
  sum(w * x) / sum(w)
}
```
### 19.5.3 Dot-dot-dot(...)
Many functions in R take an arbitrary number of inputs: 
```{r 19.5.3-1}
sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
stringr::str_c("a", "b", "c", "d", "e", "f")
```

Define a function to concatenate strings with commas
```{r 19.5.3-2}
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])


rule <- function(..., pad = "-") {
  title <- paste0(...)
  width <- getOption("width") - nchar(title) - 5
  cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Important output")
```
```{r}
x <- c(1,2)
sum(x,na.rm=T)
```

Define a function 'complicated_function' with conditions to return 0 if 'x' or 'y' is empty
```{r}
complicated_function <- function(x,y,z){
  if (lenth(x)==0 || length(y)==0){
    return(0)
  }
}
```

Improve readability of if-else blocks by using early return for simple cases
```{r}
f <- function() {
  if (x) {
    # Do 
    # something
    # that
    # takes
    # many
    # lines
    # to
    # express
  } else {
    # return something short
  }
}
```

But if the first block is very long, by the time you get to the else, you’ve forgotten the condition. One way to rewrite it is to use an early return for the simple case:
```{r 19.6.1-3}
f <- function() {
  if (!x) {
    return(something_short)
  }

  # Do 
  # something
  # that
  # takes
  # many
  # lines
  # to
  # express
}
```

This tends to make the code easier to understand, because you don't need quite so much context to understand it.

### 19.6.2 Writing pipeable functions
Define a function to show the count of missing values in a data frame
```{r 19.6.2-1}
show_missing <- function(df){
  n <- sum(is.na(df))
  cat("Missing values:",n,"\n",sep="")
  
  invisible(df)
}
```

If we call it interatively, the `invisible()` means that the input `df` does not get printed out: 
```{r}
show_missing(mtcars)
```

But it's still there, it's not just printed by default:
```{r}
x <- show_missing(mtcars)
class(x)
dim(x)
```

And we can still use it in a pipe:
```{r 19.6.2-4}
library(magrittr)
library(tidyverse)

mtcars %>% 
  show_missing() %>% 
  mutate(mpg=ifelse(mpg<20,NA,mpg)) %>% 
  show_missing()
```

## 19.7 Environment
Define a function 'f' that takes an argument 'x' and returns the sum of 'x' and 'y'
```{r 19.7-1}
f <- function(x){
  x+y
}
```
Demonstrate how changing the value of 'y' affects the result of calling function 'f'
```{r 19.7-2}
y <- 100
f(10)
y <- 1000
f(10)
```
Overload the '+' operator to behave differently based on a random condition
```{r 19.7-3}
`+` <- function(x, y) {
  if (runif(1) < 0.1) {
    sum(x, y)
  } else {
    sum(x, y) * 1.1
  }
}
table(replicate(1000, 1 + 2))
#> 
#>   3 3.3 
#> 100 900
rm(`+`)
```
# Chapter 20: Vectors 

### 20.1.1 PRerequisites 

```{r}
library(tidyverse)
```

## 20.2 Vector basics 
Determine the data type of different vectors
```{r}
typeof(letters)
typeof(1:10)
```

Determine the length of a list and display its contents
```{r}
x <- list("a","b",1:10)
length(x)
x
```
Demonstrate modulo operation and creation of logical vectors
```{r}
1:10 %% 3 ==0
c(T,T,F,NA)
```

### 20.3.2 Numeric
Integer and double vectors are known collectively as numeric vectors. In R, numbers are doubles by default. To make an integer, place an L after the number:
```{r 20.3.2-1}
typeof(1)
typeof(1L)
1.5
```
Demonstrate the behavior of floating point arithmetic
```{r 20.3.2-2}
x <- sqrt(2)^2
x
x-2

```
Demonstrate the behavior of division by zero
```{r 20.3.2-3}
c(-1,0,1)%/% 0
# [1] -Inf  NaN  Inf
```



### 20.3.3 Character
Determine the memory size of a string and a replicated string vector
```{r 20.3.3}
x <- "This is a reasonably long string."
pryr::object_size(x)

y <- rep(x,1000)
pryr::object_size(y)
```


### 20.3.4 Missing values
Note that each type of atomic vector has its own missing value:
```{r}
NA            # logical

NA_integer_   # integer

NA_real_      # double

NA_character_ # character

```
Calculate the number and proportion of elements in a vector greater than 10
```{r 20.4.1-1}
x <- sample(20,100,replace=T)
y <- x > 10
sum(y) # how many are greater than 10?
mean(y) # what proportion are greater than 10?
```
```{r 20.4.1-2}
if (length(x)){
}
```
Determine the data type of different vectors
```{r 20.4.1-3}
typeof(c(TRUE,1L))
typeof(c(1L,1.5))
typeof(c(1.5,"a"))
```
Generate random numeric or logical vectors
```{r}
sample(10)+100
runif(10)>0.5
```
Demonstrate vector arithmetic with vectors of different lengths
```{r}
1:10 +1:2
```

```{r}
1:10+1:3
```
Create a tibble with two columns, 'x' and 'y', with different lengths
```{r}
library(tidyverse)



tibble(
  x=1:4,
  y=rep(1:2,each=2)
)
```

#### 20.4.4 Naming vectors
All types of vectors can be named. You can name them during creatin with `c()`:
```{r}
c(x=1,y=2,z=4)
```

Or after the fact with `purr::set_names()`
```{r}
set_names(1:3,c("a","b","c"))
```

Named vectors are most useful for subsetting, described next.

### 20.4.5 Subsetting
Demonstrate subsetting vectors with integer indices
```{r}
x <- c("one","two","three","four","five")
x[c(3,2,5)]
```

By repeating a position, you can actually make a longer output than input:
```{r}
x[c(1,1,5,5,5,2)]
```

Negative values drop the elements at the specified positions:
```{r}
x[c(-1,-3,-5)]
```
The error message mentions subsetting with zero, which returns no values:
```{r}
x[0]
```
```{r}
library(tidyverse)
x <- c(10,3,NA,5,8,1)

# tibble test
x <- as.tibble(x,ncol=1)
names(x)="v1"
is.na(x)
x %>% filter(v1 == NA)

# all non-missing values of x
x <- c(10,3,NA,5,8,1)
x[!is.na(x)]

# all even (or missing) values of x
x[x %% 2==0]
```

3. If you have a named vector, you can subset it with a character vector:
```{r}
x <- c(abc=1, def=2,xyz=5)
x[c("xyz","def")]
```

## 20.5 Recursive vectors (lists)
Create a list with numeric elements
```{r 20.5-1}
x <- list(1,2,3)
x
```
Display the structure of lists with and without names
```{r 20.5-2}
str(x)
x_named <- list(a=1,b=2,c=3)
str(x_named)
```

Unlike atomic vectors, `list()` can contain a mix of objects:
```{r 20.5-3}
y <- list("a",1L,1.5,T)
str(y)
```

List can even contain other lists!
```{r}
z <- list(list(1,2),list(3,4))
str(z)
```

### 20.5.1 Visualizing lists 
```{r 20.5.1}
x1 <- list(c(1,2),c(3,4))
x2 <- list(list(1,2),list(3,4))
x3 <- list(1,list(2,list(3)))
x1
x2
x3
```
### 20.5.2 Subsetting
Create a list 'a' with named elements and demonstrate subsetting
```{r}
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
```

```{r 20.5.2-2}
str(a)
str(a[1:2])
str(a[4])
```
Demonstrate subsetting lists using double square brackets
```{r 20.5.2-3}
str(a[[1]])
str(a[[4]])
```
Access list elements by name using $ or [[ ]]
```{r}
a$a
a[["a"]]
```
## 20.6 Attributes
Demonstrate setting and retrieving attributes of vectors
```{r 20.6-1}
x <- 1:10
attr(x,"greeting")

attr(x,"greeting") <- "Hi!"
attr(x,"farewell") <- "Bye!"
attributes(x)
```
Demonstrate methods for class 'Date'
```{r}
as.Date
```
```{r 20.6-3}
methods("as.Date")
```
Retrieve specific methods for 'as.Date'
```{r}
getS3method("as.Date","default")
getS3method("as.Date","numeric")
```
### 20.7.1 Factors 
Demonstrate creating a factor and inspecting its attributes
```{r}
x <- factor(c("ab","cd","ab"),levels=c("ab","cd","ed"))
typeof(x)
attributes(x)
```

### 20.7.2 Dates and date-times
Dates in R are numeric vectors that represent the number of days since 1 January 1970.
```{r 20.7.2-1}
x <- as.Date("1971-01-01")
unclass(x)

typeof(x)
attributes(x)
```
Demonstrate creating and inspecting a date-time object
```{r 20.7.2-2}
x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)

typeof(x)
attributes(x)
```
Demonstrate setting and retrieving time zone for date-time object
```{r 20.7.2-3}
attr(x,"tzone") <- "US/Pacific"
x

attr(x,"tzone") <- "US/Eastern"
x
```

There is another type of date-times called POSIXIt. There are built on top of named lists:
```{r}
y <- as.POSIXlt(x)
typeof(y)
#> [1] "list"
attributes(y)
```

### 20.7.3 Tibbles
Tibbles are augmented lists: they have class “tbl_df” + “tbl” + “data.frame”, and `names` (column) and `row.names` attributes:
```{r 20.7.3-1}
tb <- tibble::tibble(x = 1:5, y = 5:1)
typeof(tb)
attributes(tb)

```
```{r 20.7.3-2}
df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
attributes(df)
```
# Chapter 21: Iteration
### 21.1.1 Prerequisites
```{r}
library(tidyverse)
```

## 21.2 For loops
Imagine we have this simple tibble:
```{r}
df <- tibble(
  a=rnorm(10),
  b=rnorm(10),
  c=rnorm(10),
  d=rnorm(10)
)
```
Calculate the median for each column in a tibble 
```{r}
median(df$a)
median(df$b)
median(df$c)
median(df$d)
```
Calculate the median for each column in the data frame 'df' using a for loop
```{r}
df
output <- vector("double",ncol(df))
for (i in seq_along(df)){
  output[[i]] <- median(df[[i]])
}
output <- tibble(output)
```
Demonstrate the behavior of seq_along and length functions with an empty vector 'y'
```{r}
y <- vector("double", 0)
seq_along(y)
#> integer(0)
1:length(y)
#> [1] 1 0
```
### 21.3.1v Modifying an existing object
Sometimes, you want to use a for loop to modify an existing object. For example, remember our challenges from functions. We wanted to rescale every column in a data frame:
```{r}
library(tidyverse)

df <- tibble(
  a=rnorm(10),
  b=rnorm(10),
  c=rnorm(10),
  d=rnorm(10)
)

rescale01 <- function(x){
  rng <- range(x,na.rm=T)
  (x-rng[1])/(rng[2]-rng[1])
}

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)

df
```

```{r}
for ( i in seq_along(df)){
  df[[i]] <- rescale01(df[[i]])
}
```

### 21.3.2 Looping patterns
```{r}
x
results <- vector("list",length(x))
names(results) <- names(x)
```

Demonstrate looping patterns using a for loop to iterate over a list 'x' and store results in a list 'results'
```{r}
for(i in seq_along(x)){
  name <- names(x)[[i]]
  value <- x[[i]]
}
```

### 21.3.3 Unknown output length
Create a vector 'output' with unknown length and store results from a for loop in it
```{r}
means <- c(0,1,2)

output <- double()
for (i in seq_along(means)){
  n <- sample(100,1)
  output <- c(output,rnorm(n,means[[i]]))
}
str(output)
output
```

Create a list 'out' with unknown length and store results from a for loop in it
```{r}
out <- vector("list",length(means))
for (i in seq_along(means)){
  n <- sample(100,1)
  out[[i]] <- rnorm(n,means[[i]])
}
str(out)
str(unlist(out))
```
### 21.3.4 Unknown sequence length


A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can't rewrite every while loop as for loop:
```{r}
for (i in seq_along(x)) {
  # body
}

# Equivalent to
i <- 1
while (i <= length(x)) {
  # body
  i <- i + 1 
}

```

Herhow we could use a while loop to find how many tries it takes to get three heads in a row: 
```{r}
flip <- function() sample(c("T", "H"), 1)

flips <- 0
nheads <- 0

while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips
```
## 21.4 For loops vs. functionals
Compare for loop and functional approaches for calculating column means in a data frame
```{r}
df <- tibble(
  a=rnorm(10),
  b=rnorm(10),
  c=rnorm(10),
  d=rnorm(10)
)
```
Using for loop
```{r}
output <- vector("double",length(df))
for (i in seq_along(df)){
  output[[i]] <- mean(df[[i]])
}
output
```
Using functional approach with a custom function 'col_mean'
```{r}
col_mean <- function(df){
  output <- vector("double",length(df))
  for (i in seq_along(df)){
    output[i] <- mean(df[[i]])
  }
  output
}
```
Define a function 'col_median' to calculate the median for each column in the data frame 'df'
```{r}
col_median <- function(df){
  output <- vector("double",hh(df))
  for (i in seq_along(df)){
    output[i] <- median(df[[i]])
  }
  output
}

col_sd <- function(df){
  output <- vector("double",length(df))
  for (i in seq_along(df)){
    output[i] <- sd(df[[i]])
  }
  output
}

df
```
Define functions f1, f2, and f3 for calculating different powers of absolute deviation from the mean
```{r}
f1 <- function(x) abs(x-mean(x))^1
f2 <- function(x) abs(x-mean(x))^2
f3 <- function(x) abs(x-mean(x))^3
```
Define a function 'f' to calculate the absolute deviation from the mean raised to a given power 'i'
```{r}
f <- function(x,i) abs(x-mean(x))^i
```
Define a function 'col_summary' to apply a summary function 'fun' to each column of the data frame 'df'
```{r}
col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}
col_summary(df, median)
col_summary(df, mean)
```
Demonstrate the use of 'map_dbl' from the 'purrr' package to apply a function to each column of the data frame 'df'
```{r}
library(purrr)
head(df)


# Reference - for loop()
output <- vector("double",length(df))
for (i in seq_along(df)){
  output[[i]] <- mean(df[[i]])
}
output

map_dbl(df,mean)
map_dbl(df,median)
map_dbl(df,sd)
```

```{r}
df %>% map_dbl(mean)
df %>% map_dbl(median)
df %>% map_dbl(sd)
```
Demonstrate the use of 'map_dbl' from the 'purrr' package with additional arguments
```{r}
map_dbl(df,mean,trim=0.5)
```
Demonstrate the use of 'map_int' from the 'purrr' package to apply a function that returns integers to each element of a list
```{r}
z <- list(x=1:3,y=4:5)
z

map_int(z,length)
```

### 21.5.1 Shortcuts 
Demonstrate the use of 'safely' from the 'purrr' package to create a safe version of a function

```{r}
safe_log <- safely(log)
str(safe_log(10))
str(safe_log("a"))
```
Demonstrate the use of 'map' from the 'purrr' package with 'safely' to apply a safe version of a function to each element of a list

```{r}
x <- list(1,10,"a")
y <- x %>% map(safely(log))
str(y)

```
Demonstrate the use of 'transpose' from the 'purrr' package to transpose a list of lists
```{r}
y <- x %>% transpose()
str(y)
```
Demonstrate the use of error handling with 'map_lgl' and 'is_null' from the 'purrr' package
```{r}
is_ok <- y$error %>% map_lgl(is_null)
x[!is_ok]
# y$result[is_ok] %>% flatten_dbl()
```

Purrr provides two other useful adverbs:
```{r}
x <- list(1,10,"a")
x %>% map_dbl(possibly(log,NA_real_))
```
Demonstrate the use of 'quietly' from the 'purrr' package to suppress errors and return results with warnings
```{r}
x <- list(1,-1)
x %>% map(quietly(log)) %>% str()
```


## 21.7 Mapping over multiple arguments 
Generate random numbers from normal distributions with different means using 'map' from the 'purrr' package
```{r}
mu <- list(5,10,-3)
mu %>% 
  map(rnorm,n=5) %>% 
  str()
```
Generate random numbers from normal distributions with different means and standard deviations using 'map2' from the 'purrr' package
```{r}
sigma <- list(1,5,10)
seq_along(mu) %>% 
  map(~rnorm(5,mu[[.]],sigma[[.]])) %>% 
  str()
```
Define a custom 'map2' function to apply a binary function to corresponding elements of two lists
```{r}
map2(mu,sigma,rnorm,n=5) %>% str()
```

```{r}
map2 <- function(x,y,f,...){
  out <- vector("list",length(x))
  for (i in seq_along(x)){
    out[[i]] <- f(x[[i]],y[[i]],...)
  }
  out
}
```
Apply a function to corresponding elements of multiple lists using 'pmap' from the 'purrr' package
```{r}

library(magrittr)
library(purrr)

n <- list(1,3,5)
args1 <- list(n,mu,sigma)
args1 %>% 
  pmap(rnorm) %>% 
  str()
```
Apply a function to corresponding elements of multiple lists with named parameters using 'pmap' from the 'purrr' package
```{r}
args2 <- list(mean=mu, sd=sigma,n=n)
args2 %>% 
  pmap(rnorm) %>% 
  str()
```
Apply a function to corresponding rows of a data frame using 'pmap' from the 'purrr' package with a tibble
```{r}
library(tidyverse)
parms <- tribble(
  ~mean,~sd,~n,
  5,1,1,
  10,5,3,
  -3,10,5
)

parms %>% 
  pmap(rnorm)
```
### 21.7.1 Involing different functions
Invoke different functions with different parameters using 'invoke_map' from the 'purrr' package
```{r}
f <- c("runif","rnorm","rpois")
param <- list(
  list(min=-1,max=1),
  list(sd=5),
  list(lambda=10)
)

f
param
```

To handle this case, you can use `invoke_map()`:
```{r}
invoke_map(f,param,n=5) %>% 
  str()
```
Invoke different functions with different parameters using 'pmap' from the 'purrr' package and a tibble
```{r}
sim <- tribble(
  ~f,      ~params,
  "runif", list(min = -1, max = 1),
  "rnorm", list(sd = 5),
  "rpois", list(lambda = 10)
)
sim %>% 
  mutate(sim = invoke_map(f, params, n = 10))
```

## 21.8 Walk
Perform side effects without returning a value for each element of a list using 'walk' from the 'purrr' package
```{r}
x <- list(1,"a",3)
x %>% 
  walk(print)
```
Perform side effects on each element of a list using 'walk' from the 'purrr' package, then save the results
```{r}
library(ggplot2)
plots <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ggplot(., aes(mpg, wt)) + geom_point())
paths <- stringr::str_c(names(plots), ".pdf")

pwalk(list(paths, plots), ggsave, path = tempdir())
```
Retain or remove elements of a list based on a predicate function using 'keep' and 'discard' from the 'purrr' package
```{r}
iris %>% 
  keep(is.factor) %>% 
  str()

iris %>% 
  discard(is.factor) %>%
  str()
```



```{r}
library(tidyverse)
library(magrittr)


```

### 21.9.2 Reduce and accumulate
Iteratively combine elements of a list using a binary function with 'reduce' from the 'purrr' package
```{r}
dfs <- list(
  age=tibble(name="John",age=30),
  sex=tibble(name=c("John","Mary"),sex=c("M","F")),
  trt=tibble(name="Mary",treatment="A")
)

dfs %>% reduce(full_join)
```
Find the intersection of multiple vectors using 'reduce' from the 'purrr' package
```{r}
vs <- list(
  c(1,3,5,6,10),
  c(1,2,3,7,8,10),
  c(1,2,3,4,8,9,10)
)
vs %>% reduce(intersect)
```
Iteratively apply a function to elements of a list using 'accumulate' from the 'purrr' package
```{r}
x <- sample(10)
x
x %>% accumulate(`+`)
```

